I am new to SQL, and I have a dataset with date values and domain column. Domain column only contains values ‘personal’ and ‘business’. What I am trying to accomplish is to calculate weekly and monthly rolling count for each domain type.
What I thought of doing is creating 2 separate columns – is_personal and is_business – with values 1 for rows where domain_type is of appropriate value. For example, if domain_type is ‘personal’, then there would be value of 1 in is_personal column. Otherwise, 1 would be in the row of is_business. Then, I was going to calculate the rolling sum.
However, I wonder if I can avoid creating additional columns and perform weekly and monthly rolling count straight from the string column in Google Big Query.
What I tried so far is “grouping by” date column using DATE_TRUNC(CAST(created_at AS date), ISOWEEK)
to ‘roll-up’ dates by week periods. When I try any rolling functions on the domain_type column, I get many errors. Some are related to trying functions not recognizable by Google Big Query, some related to the fact that I am working with the string column, and so on.
The ultimate goal I am trying to achieve is calculate weekly and monthly rolling count of ‘business’ and ‘personal’ domain types. Please let me know if I can provide additional information that will help. Thank you!
Current look of data:
Date domain_type 2017-10-02 personal 2017-10-03 business 2017-10-04 personal 2017-10-05 business 2017-10-06 personal 2017-10-07 business 2017-10-08 personal 2017-10-09 business 2017-10-10 personal 2017-10-11 business 2017-10-12 personal 2017-10-13 business 2017-10-14 personal 2017-10-15 business
Let’s say that during the week of 2017-10-02, a total of 10 users signed up with personal email address and a total of 20 signed up with business email address. During the week of 2017-10-09, a total of 25 signed up with personal email and 30 signed up with business email. Hence, for 2 weeks, rolling count for personal domain type is 35, and rolling count for business domain type is 50.
The output I am trying to achieve:
Date domain_type rolling_count_for_week 2017-10-02 personal 10 2017-10-02 business 20 2017-10-09 personal 35 2017-10-09 business 50
Advertisement
Answer
Below is for BigQuery Standard SQL
#standardSQL SELECT Date, domain_type, SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week FROM ( SELECT Date, type AS domain_type, SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) personal, SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) business FROM `project.dataset.table`, UNNEST(['personal', 'business']) type ) WHERE EXTRACT(DAYOFWEEK FROM Date) = 2 GROUP BY Date, domain_type
if to apply to sample data from your question – output is
Row Date domain_type rolling_count_for_week 1 2017-10-02 personal 1 2 2017-10-02 business 0 3 2017-10-09 personal 4 4 2017-10-09 business 4
What if, for one particular week, there is no data on dow=2 but there is data for the other days?
Good point, somehow I assumed that at least one entry per day will exist :o)
See version below that does not have this dependency
#standardSQL WITH calendar_type AS ( SELECT Date, type FROM ( SELECT MIN(Date) min_date, MAX(Date) max_date FROM `project.dataset.table` ), UNNEST(GENERATE_DATE_ARRAY(min_date, max_date)) Date, UNNEST(['personal', 'business']) type ) SELECT Date, domain_type, SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week FROM ( SELECT c.Date, type AS domain_type, SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) personal, SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) business FROM calendar_type c LEFT JOIN `project.dataset.table` t ON c.Date = t.Date AND c.type = t. domain_type ) WHERE EXTRACT(DAYOFWEEK FROM Date) = 2 GROUP BY Date, domain_type