I have a working query and looking for ideas to optimize it.
Query explanation: Within each ID group (visitor_id
), look for row where c_id != 0
. From that row, show all consecutive rows within that ID group.
select t2.* from (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as row_number from "DB"."schema"."table" where visitor_id in (select distinct visitor_id from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30') where c_id in ('101') ) ) as t2 inner join (select visitor_id, min(rn) as row_number from (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as rn from "DB"."schema"."table" where visitor_id in (select distinct visitor_id from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30') where c_id in ('101') ) ) as filtered_table where c_id != 0 group by visitor_id) as t1 on t2.visitor_id = t1.visitor_id and t2.row_number >= t1.row_number
Advertisement
Answer
so you have a common sub expression
select distinct visitor_id from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30') where c_id in ('101')
so that can be moved to a CTE and run just once. like
WITH distinct_visitors AS ( SELECT DISTINCT visitor_id FROM (SELECT * FROM "DB"."schema"."table" WHERE date >= '2021-08-01' and date <= '2021-08-30') where c_id in ('101') )
but the sub clause filter is equally valid as a top level filter, and given it’s a value inclusive range filter BETWEEN will give better performance.
WITH distinct_visitors AS ( SELECT DISTINCT visitor_id FROM "DB"."schema"."table" WHERE date BETWEEN '2021-08-01' AND'2021-08-30' AND c_id IN ('101') )
then both uses of that CTE do the same ROW_NUMBER operation so that can be a CTE
and simplified as such
WITH rw_rows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number FROM "DB"."schema"."table" WHERE visitor_id IN ( SELECT DISTINCT visitor_id FROM "DB"."schema"."table" WHERE date BETWEEN '2021-08-01' AND '2021-08-30' AND c_id in ('101') ) ) SELECT t2.* FROM rw_rows AS t2 JOIN ( SELECT visitor_id, min(rn) AS row_number FROM rw_rows AS filtered_table WHERE c_id != 0 GROUP BY visitor_id ) AS t1 ON t2.visitor_id = t1.visitor_id AND t2.row_number >= t1.row_number
So we are want to keep all rows that come after the first non-zero c_id
which a QUALIFY should be able to solve like:
WITH rw_rows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number FROM "DB"."schema"."table" WHERE visitor_id IN ( SELECT DISTINCT visitor_id FROM "DB"."schema"."table" WHERE date BETWEEN '2021-08-01' AND '2021-08-30' AND c_id in ('101') ) ) SELECT t2.*, MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id) as min_rn FROM rw_rows AS t2 QUALIFY t2.row_number >= min_rn
which without have run feels like the MIN also should be able to be moved to the QUALIFY like:
WITH rw_rows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number FROM "DB"."schema"."table" WHERE visitor_id IN ( SELECT DISTINCT visitor_id FROM "DB"."schema"."table" WHERE date BETWEEN '2021-08-01' AND '2021-08-30' AND c_id in ('101') ) ) SELECT t2.* FROM rw_rows AS t2 QUALIFY t2.row_number >= MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id)
At which point the CTE is not needed, as it’s just used once, so could be moved back in, or not as they are the same.