I’m wondering is there a way in AWS Athena to “merge” 2 parquet files into a one single table in Athena just leveraging the columnar model of parquet, meaning without do any joins or post-processing?
For example:
File 1:
id | first_name |
---|---|
1 | Jonh |
2 | Joe |
File 2:
id | last_name | status |
---|---|---|
1 | Doe | 1 |
2 | Smith | 0 |
Athena Final Table result:
id | first_name | last_name | status |
---|---|---|---|
1 | Jonh | Doe | 1 |
2 | Joe | Smith | 0 |
I know I can do a join in both files, but I was wondering since I’m using Parquet can this be achieved without a sql command for join?
Thanks!
Advertisement
Answer
Athena is basically just a modified version of Presto/Trino, which is a pure SQL interface that goes via Hive (or something like Glue/Iceberg). It doesn’t really care what the underlying storage is aside from having a reader for it. So, this is doubtful. It would need to scan each file and join on the keys as it treats parquet, orc, csv, etc all simiarly.
Parquet is also a pretty complex format. Even if those two files had the same columns, they could be laid out internally or sorted very differently. It’s not like they’re both a simple sorted CSVs where you can “just grab and merge everything from both files on line 12” or something like that.
So, I doubt you’ll find anything like this, in Athena/presto or outside of them. It doesn’t sound viable. Anything doing this would have to basically do a join anyway, even if you didn’t call it that.