I want to apply custom logic over dataset placed in Redshift. Example of input data:
userid, event, fileid, timestamp, .... 100000, start, 120, 2018-09-17 19:11:40 100000, done, 120, 2018-09-17 19:12:40 100000, done, 120, 2018-09-17 19:13:40 100000, start, 500, 2018-09-17 19:13:50 100000, done, 120, 2018-09-17 19:14:40 100000, done, 500, 2018-09-17 19:14:50 100000, done, 120, 2018-09-17 19:15:40
This means something like:
file 120: start-----done-----done-----done-----done file 150: start-----done time : 11:40----12:40----13:40-----14:40-----15:40
But it should looks like
file 120: start-----done-----done file 150: start-----done time : 11:40----12:40----13:40-----14:40-----15:40
The file 120 has been interrupted once the file 150 has been started
Keep in mind that a lot if different users here and many different files.
Cleaned data should be:
userid, event, fileid, timestamp, .... 100000, start, 120, 2018-09-17 19:11:40 100000, done, 120, 2018-09-17 19:12:40 100000, done, 120, 2018-09-17 19:13:40 100000, start, 500, 2018-09-17 19:13:50 100000, done, 500, 2018-09-17 19:14:50
It shouldn’t be able to have multiple concurrent files at once for same user. So after second one has started, events from first one should not be removed from current dataset.
The code is simple but on python and it’s easy scalable for Google Dataflow, for example, but moving 100GB+ from AWS to GC is not good idea.
Question #1: Is it possible to do it on SQL (using postgres/redshift specific features) or better to use Spark? (but not sure how to implement it there)
Question #2: Any suggestion on maybe better to use AWS Batch or whatever, cause with apache beam – it’s easy and pretty much obvious, but how AWS Batch works and how to divide the dataset on chunks (like group per user) – it’s a big question. My suggestion is to somehow unload data from redshift into S3 bucket but divide it in manner separate file=user, then if aws batch supporting this – just feed the bucket and each file should be processed concurrently on already created instances. Not sure if this is makes sense.
Advertisement
Answer
If you want to remove rows where the fileid
does not match the most recent start
for the user, you can use lag(ignore nulls)
:
select t.* from (select t.*, lag(case when event = 'start' then file_id end ignore nulls) over (partition by userid order by timestamp) as start_fileid from t ) t where event = 'start' or start_fileid = fileid;