Skip to content
Advertisement

How to calculate weekly and monthly appearances of distinct string values in SQL Google Big Query?

I am new to SQL, and I have a dataset with date values and domain column. Domain column only contains values ‘personal’ and ‘business’. What I am trying to accomplish is to calculate weekly and monthly rolling count for each domain type.

What I thought of doing is creating 2 separate columns – is_personal and is_business – with values 1 for rows where domain_type is of appropriate value. For example, if domain_type is ‘personal’, then there would be value of 1 in is_personal column. Otherwise, 1 would be in the row of is_business. Then, I was going to calculate the rolling sum.

However, I wonder if I can avoid creating additional columns and perform weekly and monthly rolling count straight from the string column in Google Big Query.

What I tried so far is “grouping by” date column using DATE_TRUNC(CAST(created_at AS date), ISOWEEK) to ‘roll-up’ dates by week periods. When I try any rolling functions on the domain_type column, I get many errors. Some are related to trying functions not recognizable by Google Big Query, some related to the fact that I am working with the string column, and so on.

The ultimate goal I am trying to achieve is calculate weekly and monthly rolling count of ‘business’ and ‘personal’ domain types. Please let me know if I can provide additional information that will help. Thank you!

Current look of data:

       Date          domain_type

     2017-10-02      personal
     2017-10-03      business
     2017-10-04      personal
     2017-10-05      business
     2017-10-06      personal
     2017-10-07      business
     2017-10-08      personal 
     2017-10-09      business
     2017-10-10      personal
     2017-10-11      business
     2017-10-12      personal
     2017-10-13      business
     2017-10-14      personal
     2017-10-15      business

Let’s say that during the week of 2017-10-02, a total of 10 users signed up with personal email address and a total of 20 signed up with business email address. During the week of 2017-10-09, a total of 25 signed up with personal email and 30 signed up with business email. Hence, for 2 weeks, rolling count for personal domain type is 35, and rolling count for business domain type is 50.

The output I am trying to achieve:

Date          domain_type  rolling_count_for_week
2017-10-02    personal           10
2017-10-02    business           20
2017-10-09    personal           35
2017-10-09    business           50

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT Date, domain_type, 
  SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week
FROM (
  SELECT Date, type AS domain_type, 
    SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) personal, 
    SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) business
  FROM `project.dataset.table`,
  UNNEST(['personal', 'business']) type
)
WHERE EXTRACT(DAYOFWEEK FROM Date) = 2
GROUP BY Date, domain_type

if to apply to sample data from your question – output is

Row Date            domain_type rolling_count_for_week   
1   2017-10-02  personal    1    
2   2017-10-02  business    0    
3   2017-10-09  personal    4    
4   2017-10-09  business    4      

What if, for one particular week, there is no data on dow=2 but there is data for the other days?

Good point, somehow I assumed that at least one entry per day will exist :o)

See version below that does not have this dependency

#standardSQL
WITH calendar_type AS (
  SELECT Date, type
  FROM (
    SELECT MIN(Date) min_date, MAX(Date) max_date
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_date, max_date)) Date,
  UNNEST(['personal', 'business']) type
)
SELECT Date, domain_type, 
  SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week
FROM (
  SELECT c.Date, type AS domain_type, 
    SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) personal, 
    SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) business
  FROM calendar_type c
  LEFT JOIN `project.dataset.table` t
  ON c.Date = t.Date AND c.type = t. domain_type 
)
WHERE EXTRACT(DAYOFWEEK FROM Date) = 2
GROUP BY Date, domain_type
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement