Skip to content
Advertisement

Process several billion records from Redshift using custom logic

I want to apply custom logic over dataset placed in Redshift. Example of input data:

This means something like:

But it should looks like

The file 120 has been interrupted once the file 150 has been started

Keep in mind that a lot if different users here and many different files.

Cleaned data should be:

It shouldn’t be able to have multiple concurrent files at once for same user. So after second one has started, events from first one should not be removed from current dataset.

The code is simple but on python and it’s easy scalable for Google Dataflow, for example, but moving 100GB+ from AWS to GC is not good idea.

Question #1: Is it possible to do it on SQL (using postgres/redshift specific features) or better to use Spark? (but not sure how to implement it there)

Question #2: Any suggestion on maybe better to use AWS Batch or whatever, cause with apache beam – it’s easy and pretty much obvious, but how AWS Batch works and how to divide the dataset on chunks (like group per user) – it’s a big question. My suggestion is to somehow unload data from redshift into S3 bucket but divide it in manner separate file=user, then if aws batch supporting this – just feed the bucket and each file should be processed concurrently on already created instances. Not sure if this is makes sense.

Advertisement

Answer

If you want to remove rows where the fileid does not match the most recent start for the user, you can use lag(ignore nulls):

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement