Skip to content
Advertisement

Optimizing SQL query – finding a group with in a group

I have a working query and looking for ideas to optimize it.

Query explanation: Within each ID group (visitor_id), look for row where c_id != 0. From that row, show all consecutive rows within that ID group.

select t2.*
from (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as row_number
  from "DB"."schema"."table"
  where visitor_id in 
     (select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
     ) 
   ) as t2
inner join
(select visitor_id, min(rn) as row_number
from
  (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as rn
  from "DB"."schema"."table"
  where visitor_id in 
     (select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
     ) 
   ) as filtered_table
where c_id != 0
group by visitor_id) as t1

on t2.visitor_id = t1.visitor_id
and t2.row_number >= t1.row_number

Advertisement

Answer

so you have a common sub expression

select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')

so that can be moved to a CTE and run just once. like

WITH distinct_visitors AS (
    SELECT DISTINCT visitor_id 
     FROM (SELECT * FROM "DB"."schema"."table" WHERE date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
)

but the sub clause filter is equally valid as a top level filter, and given it’s a value inclusive range filter BETWEEN will give better performance.

WITH distinct_visitors AS (
   SELECT DISTINCT visitor_id 
    FROM "DB"."schema"."table" 
    WHERE date BETWEEN '2021-08-01' AND'2021-08-30'
        AND c_id IN ('101')
)

then both uses of that CTE do the same ROW_NUMBER operation so that can be a CTE

and simplified as such

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*
FROM rw_rows AS t2
JOIN (
    SELECT visitor_id, 
        min(rn) AS row_number
    FROM rw_rows AS filtered_table
    WHERE c_id != 0
    GROUP BY visitor_id
) AS t1
    ON t2.visitor_id = t1.visitor_id
        AND t2.row_number >= t1.row_number

So we are want to keep all rows that come after the first non-zero c_id which a QUALIFY should be able to solve like:

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*,
    MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id) as min_rn
FROM rw_rows AS t2
QUALIFY t2.row_number >= min_rn

which without have run feels like the MIN also should be able to be moved to the QUALIFY like:

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*
FROM rw_rows AS t2
QUALIFY t2.row_number >= MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id)

At which point the CTE is not needed, as it’s just used once, so could be moved back in, or not as they are the same.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement