Skip to content
Advertisement

Join 2 parquet files with different columns but common key(id) column in Athena

I’m wondering is there a way in AWS Athena to “merge” 2 parquet files into a one single table in Athena just leveraging the columnar model of parquet, meaning without do any joins or post-processing?

For example:

File 1:

id first_name
1 Jonh
2 Joe

File 2:

id last_name status
1 Doe 1
2 Smith 0

Athena Final Table result:

id first_name last_name status
1 Jonh Doe 1
2 Joe Smith 0

I know I can do a join in both files, but I was wondering since I’m using Parquet can this be achieved without a sql command for join?

Thanks!

Advertisement

Answer

Athena is basically just a modified version of Presto/Trino, which is a pure SQL interface that goes via Hive (or something like Glue/Iceberg). It doesn’t really care what the underlying storage is aside from having a reader for it. So, this is doubtful. It would need to scan each file and join on the keys as it treats parquet, orc, csv, etc all simiarly.

Parquet is also a pretty complex format. Even if those two files had the same columns, they could be laid out internally or sorted very differently. It’s not like they’re both a simple sorted CSVs where you can “just grab and merge everything from both files on line 12” or something like that.

So, I doubt you’ll find anything like this, in Athena/presto or outside of them. It doesn’t sound viable. Anything doing this would have to basically do a join anyway, even if you didn’t call it that.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement