How to count monthly retention user in bigquery?

Question

I have raw data as below. With each line is the record of an transaction of user, and the month when they made the transaction What I want is to calculate the number of user who made order in a month and the number of repeated user (RETENTION) from last month, then I can know how many % of user

Accepted Answer

One way to do it is to do it is through a self-join with the same table and a 1-month delay. That way, we match user&month combinations with user&previous-month to see if it&#8217;s a returning user. For example, using the 2M row public table bigquery-public-data.hacker_news.stories and a particular user:Note that prev_month is null (we used LEFT OUTER JOIN) for 2014-02-01 as the user was not active during 2014-01-01. We are joining on author and lagged months with:FROM authors AS a LEFT OUTER JOIN authors AS bON a.author = b.authorAND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)Then we count a user as repeating if the previous month was not null:COUNT(a.author) AS num_users,COUNTIF(b.month IS NOT NULL) AS num_returning_usersNote that we do not use DISTINCT here as we already grouped by author and month combinations when defining orders as CTE. You might need to take this into account for other examples.Full query:WITH  authors AS (  SELECT    author,    DATE_TRUNC(DATE(time_ts), MONTH) AS month  FROM    `bigquery-public-data.hacker_news.stories`  WHERE    author IS NOT NULL  GROUP BY 1,2)SELECT  *,  ROUND(100*SAFE_DIVIDE(num_returning_users,      num_users),2) AS retentionFROM (  SELECT    a.month,    COUNT(a.author) AS num_users,    COUNTIF(b.month IS NOT NULL) AS num_returning_users  FROM    authors AS a  LEFT OUTER JOIN    authors AS b  ON    a.author = b.author    AND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)  GROUP BY 1  ORDER BY 1  LIMIT 100)and results snippet:which are correct results, i.e. for 2007-03-01:Performance is not too fancy but in this case we are selecting only the fields needed for the aggregated data so scanned data is low and execution time not too high (~5s). An alternative is to use EXISTS() inside COUNTIF() instead of the join but it takes longer for me (~7s). Query

Advertisement

Answer