Process several billion records from Redshift usin…

I want to apply custom logic over dataset placed in Redshift. Example of input data:

userid, event,  fileid, timestamp, ....
100000, start,  120,    2018-09-17 19:11:40
100000, done,   120,    2018-09-17 19:12:40
100000, done,   120,    2018-09-17 19:13:40
100000, start,  500,    2018-09-17 19:13:50
100000, done,   120,    2018-09-17 19:14:40
100000, done,   500,    2018-09-17 19:14:50
100000, done,   120,    2018-09-17 19:15:40

​x
 
userid, event,  fileid, timestamp, ....100000, start,  120,    2018-09-17 19:11:40100000, done,   120,    2018-09-17 19:12:40100000, done,   120,    2018-09-17 19:13:40100000, start,  500,    2018-09-17 19:13:50100000, done,   120,    2018-09-17 19:14:40100000, done,   500,    2018-09-17 19:14:50100000, done,   120,    2018-09-17 19:15:40​

This means something like:

file 120:  start-----done-----done-----done-----done 
file 150:                      start-----done   
time    :  11:40----12:40----13:40-----14:40-----15:40

 
file 120:  start-----done-----done-----done-----done file 150:                      start-----done   time    :  11:40----12:40----13:40-----14:40-----15:40​

But it should looks like

file 120:  start-----done-----done 
file 150:                      start-----done   
time    :  11:40----12:40----13:40-----14:40-----15:40

 
file 120:  start-----done-----done file 150:                      start-----done   time    :  11:40----12:40----13:40-----14:40-----15:40​

The file 120 has been interrupted once the file 150 has been started

Keep in mind that a lot if different users here and many different files.

Cleaned data should be:

userid, event,  fileid, timestamp, ....
100000, start,  120,    2018-09-17 19:11:40
100000, done,   120,    2018-09-17 19:12:40
100000, done,   120,    2018-09-17 19:13:40
100000, start,  500,    2018-09-17 19:13:50
100000, done,   500,    2018-09-17 19:14:50

 
userid, event,  fileid, timestamp, ....100000, start,  120,    2018-09-17 19:11:40100000, done,   120,    2018-09-17 19:12:40100000, done,   120,    2018-09-17 19:13:40100000, start,  500,    2018-09-17 19:13:50100000, done,   500,    2018-09-17 19:14:50​

It shouldn’t be able to have multiple concurrent files at once for same user. So after second one has started, events from first one should not be removed from current dataset.

The code is simple but on python and it’s easy scalable for Google Dataflow, for example, but moving 100GB+ from AWS to GC is not good idea.

Question #1: Is it possible to do it on SQL (using postgres/redshift specific features) or better to use Spark? (but not sure how to implement it there)

Question #2: Any suggestion on maybe better to use AWS Batch or whatever, cause with apache beam – it’s easy and pretty much obvious, but how AWS Batch works and how to divide the dataset on chunks (like group per user) – it’s a big question. My suggestion is to somehow unload data from redshift into S3 bucket but divide it in manner separate file=user, then if aws batch supporting this – just feed the bucket and each file should be processed concurrently on already created instances. Not sure if this is makes sense.

Answer

If you want to remove rows where the fileid does not match the most recent start for the user, you can use lag(ignore nulls):

select t.*
from (select t.*,
             lag(case when event = 'start' then file_id end ignore nulls) over (partition by userid order by timestamp) as start_fileid
      from t
     ) t
where event = 'start' or start_fileid = fileid;

 
select t.*from (select t.*,             lag(case when event = 'start' then file_id end ignore nulls) over (partition by userid order by timestamp) as start_fileid      from t     ) twhere event = 'start' or start_fileid = fileid;​

Process several billion records from Redshift using custom logic

Advertisement

Answer