Skip to content
Advertisement

How to write a single query to run some retrospective aggregation when time window is different for every row?

I am writing some SQL queries to create a dataset for customer churn predictions based on historical service data. Some of the services date back years ago. Small percentage of them churned at some time in the past while others ended up getting renewed. Some of the attributes are based on aggregation of the services that were active when each of the services was active. For example, I want to find out how many total services were active that was under the same account when an individual account was active.

| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
|------------|------------|------------------|-----------------|-------------|-------|
| A1         | S1         | 2018-01          | 2019-03         | 2019-02     | 1     |
| A2         | S2         | 2016-05          | 2020-04         | 2020-03     | 0     |
| A1         | S3         | 2018-07          | 2019-07         | 2019-06     | 0     |
| A1         | S4         | 2016-11          | 2021-03         | 2021-02     | 0     |
| A1         | S5         | 2018-01          | 2019-01         | 2018-12     | 1     |
| A2         | S6         | 2019-01          | 2021-08         | 2021-07     | 1     |

The total active services for service S1 and S3 under the same account A are 4 and 2 respectively because they have different ExtractDate. I am currently running SQL query with Python wrapper with account_no and extractDate as input arguments. And then loop through the whole dataset and run the queries for each row

def stats_per_account(conn, account, till_date):
    query = """select count(distinct ServiceNo) from ServicesAll 
               where AccountNo = '{0}' and 
               InstallationDate < '{1}' and
               (BillingStopDate is NULL or BillingStopDate >= '{2}')""".format(account, till_date, till_date)
    with conn.cursor() as cursor:
        cursor.execute(query)
        row = cursor.fetchone()
       
    return row

sno = []
ano = []
svc = []

for _, row in tqdm(iter_df.iterrows(), total=iter_df.shape[0]):

    cnt = stats_per_account(conn, row['AccountNo'], row['ExtractDate'])
    sno.append(row['ServiceNo'])
    ano.append(row['AccountNo'])
    svc.append(cnt[0])

Since the queries are run one after another sequentially for each row in the dataset. It is very time consuming. I wonder if there is a more efficient way of doing this. Maybe a single query for all the rows altogether?

Advertisement

Answer

Here’s a rough example you might be able to follow. I use a WITH clause to provide the set of parameters to apply, but you could also store this in a base table to use in the JOIN.

If you have an algorithmic way to generate the sets of parameters, that could be used in a derived table or CTE term to provide as many sets as you wish. We could even use recursive behavior, if your database supports it.

A list of date ranges and list of accounts could easily be used to generate sets of parameters.

The data:

SELECT * FROM ServicesAll;

+------------+------------+------------------+-----------------+-------------+-------+
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
+------------+------------+------------------+-----------------+-------------+-------+
| A1         | S1         | 2018-01-01       | 2019-03-01      | 2019-02-01  |     1 |
| A2         | S2         | 2016-05-01       | 2020-04-01      | 2020-03-01  |     0 |
| A1         | S3         | 2018-07-01       | 2019-07-01      | 2019-06-01  |     0 |
| A1         | S4         | 2016-11-01       | 2021-03-01      | 2021-02-01  |     0 |
| A1         | S5         | 2018-01-01       | 2019-01-01      | 2018-12-01  |     1 |
| A2         | S6         | 2019-01-01       | 2021-08-01      | 2021-07-01  |     1 |
+------------+------------+------------------+-----------------+-------------+-------+

Here’s an approach which generates lists of accounts and date ranges to derive the sets of parameters to apply, and then applies them all in the same query:

WITH RECURSIVE accounts (account_no) AS ( SELECT DISTINCT account_no FROM ServicesAll)
   , startDate (start_date)          AS ( SELECT current_date - INTERVAL '5' YEAR )
   , ranges  (start_date, end_date, n) AS (
          SELECT start_date, start_date + INTERVAL '6' MONTH, 1 FROM startDate UNION ALL
          SELECT start_date + INTERVAL '6' MONTH
               , end_date   + INTERVAL '6' MONTH, n+1 FROM ranges WHERE n < 8
     )
   , args (p0, p1, p2) AS (
          SELECT account_no, start_date, end_date
            FROM accounts, ranges
     )
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
  FROM ServicesAll, args
 WHERE account_no = args.p0
   AND InstallationDate < args.p1
   AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
 GROUP BY p0, p1, p2
;

The result:

+------+------------+------------+----------------------------+
| p0   | p1         | p2         | COUNT(DISTINCT service_no) |
+------+------------+------------+----------------------------+
| A1   | 2017-02-17 | 2017-08-17 |                          1 |
| A1   | 2017-08-17 | 2018-02-17 |                          1 |
| A1   | 2018-02-17 | 2018-08-17 |                          3 |
| A1   | 2018-08-17 | 2019-02-17 |                          3 |
| A1   | 2019-02-17 | 2019-08-17 |                          1 |
| A1   | 2019-08-17 | 2020-02-17 |                          1 |
| A1   | 2020-02-17 | 2020-08-17 |                          1 |
| A2   | 2016-08-17 | 2017-02-17 |                          1 |
| A2   | 2017-02-17 | 2017-08-17 |                          1 |
| A2   | 2017-08-17 | 2018-02-17 |                          1 |
| A2   | 2018-02-17 | 2018-08-17 |                          1 |
| A2   | 2018-08-17 | 2019-02-17 |                          1 |
| A2   | 2019-02-17 | 2019-08-17 |                          2 |
| A2   | 2019-08-17 | 2020-02-17 |                          2 |
| A2   | 2020-02-17 | 2020-08-17 |                          1 |
+------+------------+------------+----------------------------+

The following SQL provide sets of parameters in the args CTE term, and then JOINs with that CTE term in the final query expression. The key is the GROUP BY clause which produces a result row for each set of parameters.

WITH args (p0, p1, p2) AS (
          SELECT 'A1', '2018-07-02', '2018-07-02' UNION
          SELECT 'A1', '2016-11-02', '2016-11-02' UNION
          SELECT 'A2', '2016-11-02', '2016-11-02'
     )
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
  FROM ServicesAll, args
 WHERE account_no = args.p0
   AND InstallationDate < args.p1
   AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
 GROUP BY p0, p1, p2
;

Result:

+----+------------+------------+----------------------------+
| p0 | p1         | p2         | COUNT(DISTINCT service_no) |
+----+------------+------------+----------------------------+
| A1 | 2016-11-02 | 2016-11-02 |                          1 |
| A1 | 2018-07-02 | 2018-07-02 |                          4 |
| A2 | 2016-11-02 | 2016-11-02 |                          1 |
+----+------------+------------+----------------------------+
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement