I am trying to compute DAU WAU MAU ratios.
- DAU: active users on the day
- WAU: active users of the past 7 days
- MAU: active users of the past 30 days
The DAU:WAU, DAU:MAU, and WAU:MAU ratios compute stickiness of user engagement in terms of %.
I have a table called my_table
that entails
datetime_field
user_id
This table lists all user_id
that were active on a given datetime_field
.
For the previous 6 months (excluding current month), I want to see the three above mentioned ratios (DAU:WAU, DAU:MAU, WAU:MAU) for every day. In other words for every day, I need the daily active user (a distinct count on users on that day), the active users of the past week (distinct count on users for the last 7 days of that day), and the active users of the past month (distinct count on users for the last 30 days of that day).
When running the following query (with variable names modified):
WITH s AS ( SELECT DATE_TRUNC('day', datetime_field) AS dt, COUNT(DISTINCT user_id)::float AS dau FROM my_table WHERE datetime_field <= current_date - INTERVAL '1 month' AND datetime_field > current_date - INTERVAL '7 months' GROUP BY dt ) SELECT dt, dau, (SELECT COUNT(DISTINCT user_id) FROM my_table WHERE datetime_field BETWEEN dt - 7 * INTERVAL '1 day' AND dt) AS wau, (SELECT COUNT(DISTINCT user_id) FROM my_table WHERE datetime_field BETWEEN dt - 30 * INTERVAL '1 day' AND dt) AS mau, dau / NULLIF(wau, 0) AS dau_wau, wau / NULLIF(mau, 0) AS wau_mau FROM s ORDER BY dt
I get the following error:
[Amazon](500310) Invalid operation: This type of correlated subquery pattern is not supported due to internal error;
When I run it with only the DAU:WAU or only the DAU:MAU computation it works. But I also need WAU:MAU, so I have to compute all three metrics in one query. Is there a way to make it work with all metrics?
Advertisement
Answer
Correlated subqueries are exceptionally expensive to compute and Redshift limits support for these to situations where the optimizer can rewrite the logic to be processed efficiently in parallel. See this “Death by correlated subqueries” blog post for an explanation of why they are expensive.
When Redshift gives the “correlated subquery pattern is not supported” message you can typically rewrite the query to something that will run and be much faster. The following rewrite does the comparison over calendar week/month rather than using rolling date windows.
WITH data_set AS ( SELECT DATE_TRUNC('day', datetime_field) AS dt , user_id FROM my_table --May want to pin this range to calendar months WHERE datetime_field <= current_date - INTERVAL '1 month' AND datetime_field > current_date - INTERVAL '7 months' GROUP BY dt ), daily_count AS ( SELECT dt , DATE_TRUNC('week', dt) AS wk , DATE_TRUNC('month', dt) AS mth , COUNT(DISTINCT user_id) AS dau FROM data_set GROUP BY dt --Per calendar week (not rolling) ), weekly_count AS ( SELECT DATE_TRUNC('week', dt) AS wk , COUNT(DISTINCT user_id) AS wau FROM data_set GROUP BY wk --Per calendar month (not rolling) ), monthly_count AS ( SELECT DATE_TRUNC('month', dt) AS mth , COUNT(DISTINCT user_id) AS mau FROM data_set GROUP BY mth ) SELECT dt , dau , dau / NULLIF(wau, 0) AS dau_wau , wau / NULLIF(mau, 0) AS wau_mau FROM daily_count JOIN weekly_count USING (wk) JOIN monthly_count USING (mth) ORDER BY dt
Worth noting that the multiple COUNT(DISTINCT x)
here are still quite expensive. If you intend to run this analysis frequently and/or “slice and dice” the distinct counts by many other facets then I recommend using of Redshift’s HyperLogLog
functions which allow you to calculate approximate distinct counts very cheaply.