Skip to content
Advertisement

PrestoDB/AWS Athena- Retrieve a large SELECT by chunks

I have to select more than 1.9 billion rows. I am trying to query a table hosted in a DB in AWS ATHENA console. The table is reading parquet files from the a S3 bucket.

When I run this query:

SELECT * FROM ids WHERE org = 'abcd' AND idkey = 'email-md5';

My query seems to time-Out as there are 1.9 billion rows that are returned when I run a COUNT on it.

I tried OFFSET along with LIMIT but it doesn’t seem to work in AWS Athena.

Also tried something on the lines

SELECT * FROM ids WHERE org = 'abcd' AND idkey = 'email-md5' LIMIT 0,500;

This doesn’t seem to work as well.

Not sure how to chunk with such a large dataset using SELECT?

The aim here is to be able to query the entire dataset without having the query time out.

I ran a COUNT-

SELECT COUNT(*) FROM ids WHERE org = 'abcd' AND idkey = 'email-md5';

And the COUNT returned is 1.9 Billion as mentioned above. I need to pull all the 1.9 Billion rows so that i can then download it in and do further analysis.

Advertisement

Answer

It appears that your situation is:

  • A daily ETL process provides new Parquet files on a daily basis
  • One table has 1.9 billion rows
  • Queries are timing-out in Athena

It would appear that your issue is related to Athena having to query so much data. Some ways to improve the efficient (and cost) of Athena are:

  • Use columnar-format files (you are using Parquet, so that is great!)
  • Compress the files (less to read from disk means it is faster and costs less for queries)
  • Partition the files (which allows Athena to totally skip files that aren’t relevant)

The simplest one for your situation would probably be to start partitioning the data by putting the daily files into separate directories based upon something that is normally included in the WHERE statement. This would normally be dates, which is easy to partition (eg different directory per day or month), but might not be relevant given your filtering on org and idkey.

Another option would be transform the incoming files into a new table with relevant data. For example, you could create a table with a summary of the rows, such as a table that contains org, idkey and a count of those rows. Thus, multiple rows would be reduced to a single row within the file. This needs a better knowledge the content of the files and how you intend to query, but it would optimize those queries. Basically, you would process each day’s new files into the computed table, then run queries against the computed table rather than the raw data. (Commonly known as an ETL process.)

A final suggestion would be to import the data into Amazon Redshift. It can handle billions of rows quite easily and can store the data in a compressed, optimized manner. This is only useful if you run lots of queries against the data. If you only run a few queries a day, then Athena would be a better choice.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement