Skip to content
Advertisement

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.

In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.

Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.

The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?

The 3 tables are:

  • Providers
  • Consumers
  • Transactions

All of them are indexed by the key which is used in the join. My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.

I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space. After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn’t had much effect.

CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);

This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will. How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?

To summarize my questions:

  1. How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
  2. Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
  3. What is an efficient way of inserting new data to the destination table while dropping duplications?
  4. Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.

Advertisement

Answer

I would recommend using Amazon Athena to join the files and produce the desired output.

First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.

If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.

Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.

See: Creating a Table from Query Results (CTAS) – Amazon Athena

Glue can also be used for Program AWS Glue ETL Scripts in Python – AWS Glue, but I haven’t tried this so I can’t offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.

Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.

General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement