Skip to content
Advertisement

troubleshooting a SQL code for cohort analysis

Having a problem with a code I wrote to perform a cohort analysis for repeat purchases. Can someone help me? My code was able to identify repeat purchases and assigning cohort for the most part BUT it is unable to assign a proper label for the last purchase of that customer. It’s being labeled as “new to brand” highlighted in red here. It should be labeled as “0-15 days.” Any suggestion? I realized that my code doesn’t differentiate between cases when this is the user’s first purchase or their final one. I’m not sure how to troubleshoot it. Any suggestion appreciated. Code and pic attached.

OLD RESULT: https://i.stack.imgur.com/BZvpy.png

NEW RESULT WITH NEW CODE: https://i.stack.imgur.com/NUndb.png

SELECT DISTINCT "T1"."amazon-order-id", 
"T1"."BRAND",
"T1"."buyer-email", 
"T1"."purchase-date", 
MIN("T2"."purchase-date") AS "Next Purchase Date",
MAX("T2"."purchase-date") AS "Last Purchase Date",
DATEDIFF(Day,"T1"."purchase-date", MIN("T2"."purchase-date")) AS "Days Difference",
CASE
WHEN "Days Difference" < 15 THEN '0-15 days'
WHEN "Days Difference" < 30 THEN '15-30 days'
WHEN "Days Difference" < 45 THEN '30-45 days'
WHEN "Days Difference" < 60 THEN '45-60 days'
WHEN "Days Difference" > 60 THEN '60+ days'
WHEN "T1"."purchase-date" = "Next Purchase Date" THEN 'NTB'
END "Cohort",
sum("T1"."shipped-quantity") AS "Units Sold", 
sum("T1"."item-price") AS "Sales"
    
FROM "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T1
LEFT JOIN
    "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T2
    ON "T1"."buyer-email" = "T2"."buyer-email"
    AND "T2"."purchase-date" > "T1"."purchase-date"
GROUP BY "T1"."amazon-order-id", 
"T1"."buyer-email", 
"T1"."BRAND",
"T1"."purchase-date"
ORDER BY 
"T1"."BRAND",
"T1"."buyer-email", 
"T1"."purchase-date",
"T1"."amazon-order-id"

Advertisement

Answer

The problem is with your case statement. You are Left Joining T2 with purchases GREATER than the T1 Purchase Date, which for the last purchase would mean the T2 table would not return any rows.

To get Latest purchases to be labeled in the 0-15 day category, you would need to include an “Is Null” criteria or add a coalesce statement into your DATEDIFF if MIN(T2.Purchase_Date) is null to pull the T1 date.

Edit: Something like this may work if you only have one purchase per date. You may not need the second left join to get the Last Purchase Date if you don’t need that field in the query:

SELECT "T1"."amazon-order-id"
    ,"T1"."BRAND"
    ,"T1"."buyer-email"
    ,"T1"."purchase-date"
    ,"T2"."purchase-date" AS "Next Purchase Date"
    ,"T3"."purchase-date" AS "Last Purchase Date"
    ,DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") AS "Days Difference"
    ,CASE 
        WHEN LAG("T1"."purchase-date") OVER (
                PARTITION BY "T1"."buyer-email" ORDER BY "T1"."purchase-date"
                ) IS NULL
            THEN 'NTB'
        WHEN "T2"."purchase-date" IS NULL
            THEN 'Last_Purchase'
        WHEN DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") < 15
            THEN '0-15 days'
        WHEN DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") < 30
            THEN '15-30 days'
        WHEN DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") < 45
            THEN '30-45 days'
        WHEN DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") < 60
            THEN '45-60 days'
        WHEN DATEDIFF(Day, "T1"."purchase-date", "T2"."purchase-date") > 60
            THEN '60+ days'
        END "Cohort"
    ,"T1"."shipped-quantity" AS "Units Sold"
    ,"T1"."item-price" AS "Sales"
FROM "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T1
LEFT JOIN "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T2 ON "T1"."buyer-email" = "T2"."buyer-email"
    AND "T2"."purchase-date" = (
        SELECT Min(T2E."purchase-date")
        FROM "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T2E
        WHERE "T1"."buyer-email" = T2E ."buyer-email"
                                and "T1"."purchase-DATE" < "T2E"."purchase-DATE")
LEFT JOIN
    "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T3
    ON "T1"."buyer-email" = "T3"."buyer-email"
    AND "T3"."purchase-DATE" = (select max(T2E."purchase-DATE") 
                                from "DATA_WAREHOUSE"."PUBLIC"."AMAZON_FULFILLED_SHIPMENT_REPORT" AS T2E 
                                where "T1"."buyer-email"=T2E."buyer-email"
            AND "T1"."purchase-date" < "T2E"."purchase-date"
        )
ORDER BY "T1"."BRAND"
    ,"T1"."buyer-email"
    ,"T1"."purchase-date"
    ,"T1"."amazon-order-id"

In SQL Server, I would have normally just done this with a windows function within an Outer/Cross Apply to reduce the number of times referencing the DATEDIFF, but it’s not available in snowflake AFAIK.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement