I have a dataset in BigQuery where I want to count all non-duplicate actions. A duplicate action is one that occurs within a certain timeframe or ‘countdown’ from prior actions. Once the countdown reaches 0, the timer resets and the very next action is no longer considered to be duplicate.
In my example, the countdown timer starts at 10 seconds.
- Sample code: http://sqlfiddle.com/#!17/2dd69/1
- HitTime: Timestamp (seconds) for each row
- DesiredOutput: What I am trying to create. Does not exist in my dataset
- DesiredOutputExplation: An explanation to help explain the logic. Does not exist in my dataset
Advertisement
Answer
So I took the advice from shawnt00 and produced the following output in BigQuery.
DECLARE counter INT64 DEFAULT 1; DECLARE timerStart INT64 DEFAULT 10; DECLARE maxRows INT64; DECLARE countdown INT64; CREATE TEMP TABLE myData (HitTime INT64); INSERT INTO myData VALUES (0),(21),(24),(30),(32); CREATE TEMP TABLE myData2 AS SELECT *, ROW_NUMBER() OVER(ORDER BY HitTime) AS rowNum, HitTime - LAG(HitTime, 1) OVER(ORDER BY HitTime) AS timeBetween, 0 AS Output FROM myData; SET maxRows = (SELECT COUNT(*) FROM myData); SET countdown = timerStart; WHILE counter<=maxRows DO IF (SELECT timeBetween FROM myData2 WHERE rowNum = counter) IS NULL OR (SELECT timeBetween FROM myData2 WHERE rowNum = counter) >= countdown THEN UPDATE myData2 SET Output = 1 WHERE rowNum = counter; SET countdown = timerStart; ELSE UPDATE myData2 SET Output = 0 WHERE rowNum = counter; SET countdown = countdown - (SELECT timeBetween FROM myData2 WHERE rowNum = counter); END IF; SET counter=counter+1; END WHILE; SELECT HitTime, Output FROM myData2 ORDER BY HitTime;
While this does produce my desired output, I am unsure if there is a more effective way to do this (I’m pretty new to T-SQL). Especially considering my production dataset has ~800M rows in it.