PrestoDB/AWS Athena- Retrieve a large SELECT by chunks

Question

I have to select more than 1.9 billion rows. I am trying to query a table hosted in a DB in AWS ATHENA console. The table is reading parquet files from the a S3 bucket. When I run this query: My query seems to time-Out as there are 1.9 billion rows that are returned when I run a COUNT on

Accepted Answer

It appears that your situation is:A daily ETL process provides new Parquet files on a daily basisOne table has 1.9 billion rowsQueries are timing-out in AthenaIt would appear that your issue is related to Athena having to query so much data. Some ways to improve the efficient (and cost) of Athena are:Use columnar-format files (you are using Parquet, so that is great!)Compress the files (less to read from disk means it is faster and costs less for queries)Partition the files (which allows Athena to totally skip files that aren&#8217;t relevant)The simplest one for your situation would probably be to start partitioning the data by putting the daily files into separate directories based upon something that is normally included in the WHERE statement. This would normally be dates, which is easy to partition (eg different directory per day or month), but might not be relevant given your filtering on org and idkey.Another option would be transform the incoming files into a new table with relevant data. For example, you could create a table with a summary of the rows, such as a table that contains org, idkey and a count of those rows. Thus, multiple rows would be reduced to a single row within the file. This needs a better knowledge the content of the files and how you intend to query, but it would optimize those queries. Basically, you would process each day&#8217;s new files into the computed table, then run queries against the computed table rather than the raw data. (Commonly known as an ETL process.)A final suggestion would be to import the data into Amazon Redshift. It can handle billions of rows quite easily and can store the data in a compressed, optimized manner. This is only useful if you run lots of queries against the data. If you only run a few queries a day, then Athena would be a better choice.

Advertisement

Answer