I am writing some SQL queries to create a dataset for customer churn predictions based on historical service data. Some of the services date back years ago. Small percentage of them churned at some time in the past while others ended up getting renewed. Some of the attributes are based on aggregation of the services that were active when each of the services was active. For example, I want to find out how many total services were active that was under the same account when an individual account was active.
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn | |------------|------------|------------------|-----------------|-------------|-------| | A1 | S1 | 2018-01 | 2019-03 | 2019-02 | 1 | | A2 | S2 | 2016-05 | 2020-04 | 2020-03 | 0 | | A1 | S3 | 2018-07 | 2019-07 | 2019-06 | 0 | | A1 | S4 | 2016-11 | 2021-03 | 2021-02 | 0 | | A1 | S5 | 2018-01 | 2019-01 | 2018-12 | 1 | | A2 | S6 | 2019-01 | 2021-08 | 2021-07 | 1 |
The total active services for service S1
and S3
under the same account A
are 4 and 2 respectively because they have different ExtractDate
. I am currently running SQL query with Python wrapper with account_no
and extractDate
as input arguments. And then loop through the whole dataset and run the queries for each row
def stats_per_account(conn, account, till_date): query = """select count(distinct ServiceNo) from ServicesAll where AccountNo = '{0}' and InstallationDate < '{1}' and (BillingStopDate is NULL or BillingStopDate >= '{2}')""".format(account, till_date, till_date) with conn.cursor() as cursor: cursor.execute(query) row = cursor.fetchone() return row sno = [] ano = [] svc = [] for _, row in tqdm(iter_df.iterrows(), total=iter_df.shape[0]): cnt = stats_per_account(conn, row['AccountNo'], row['ExtractDate']) sno.append(row['ServiceNo']) ano.append(row['AccountNo']) svc.append(cnt[0])
Since the queries are run one after another sequentially for each row in the dataset. It is very time consuming. I wonder if there is a more efficient way of doing this. Maybe a single query for all the rows altogether?
Advertisement
Answer
Here’s a rough example you might be able to follow. I use a WITH clause
to provide the set of parameters to apply, but you could also store this in a base table to use in the JOIN
.
If you have an algorithmic way to generate the sets of parameters, that could be used in a derived table or CTE term to provide as many sets as you wish. We could even use recursive behavior, if your database supports it.
A list of date ranges and list of accounts could easily be used to generate sets of parameters.
The data:
SELECT * FROM ServicesAll; +------------+------------+------------------+-----------------+-------------+-------+ | account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn | +------------+------------+------------------+-----------------+-------------+-------+ | A1 | S1 | 2018-01-01 | 2019-03-01 | 2019-02-01 | 1 | | A2 | S2 | 2016-05-01 | 2020-04-01 | 2020-03-01 | 0 | | A1 | S3 | 2018-07-01 | 2019-07-01 | 2019-06-01 | 0 | | A1 | S4 | 2016-11-01 | 2021-03-01 | 2021-02-01 | 0 | | A1 | S5 | 2018-01-01 | 2019-01-01 | 2018-12-01 | 1 | | A2 | S6 | 2019-01-01 | 2021-08-01 | 2021-07-01 | 1 | +------------+------------+------------------+-----------------+-------------+-------+
Here’s an approach which generates lists of accounts and date ranges to derive the sets of parameters to apply, and then applies them all in the same query:
WITH RECURSIVE accounts (account_no) AS ( SELECT DISTINCT account_no FROM ServicesAll) , startDate (start_date) AS ( SELECT current_date - INTERVAL '5' YEAR ) , ranges (start_date, end_date, n) AS ( SELECT start_date, start_date + INTERVAL '6' MONTH, 1 FROM startDate UNION ALL SELECT start_date + INTERVAL '6' MONTH , end_date + INTERVAL '6' MONTH, n+1 FROM ranges WHERE n < 8 ) , args (p0, p1, p2) AS ( SELECT account_no, start_date, end_date FROM accounts, ranges ) SELECT p0, p1, p2, COUNT(DISTINCT service_no) FROM ServicesAll, args WHERE account_no = args.p0 AND InstallationDate < args.p1 AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2) GROUP BY p0, p1, p2 ;
The result:
+------+------------+------------+----------------------------+ | p0 | p1 | p2 | COUNT(DISTINCT service_no) | +------+------------+------------+----------------------------+ | A1 | 2017-02-17 | 2017-08-17 | 1 | | A1 | 2017-08-17 | 2018-02-17 | 1 | | A1 | 2018-02-17 | 2018-08-17 | 3 | | A1 | 2018-08-17 | 2019-02-17 | 3 | | A1 | 2019-02-17 | 2019-08-17 | 1 | | A1 | 2019-08-17 | 2020-02-17 | 1 | | A1 | 2020-02-17 | 2020-08-17 | 1 | | A2 | 2016-08-17 | 2017-02-17 | 1 | | A2 | 2017-02-17 | 2017-08-17 | 1 | | A2 | 2017-08-17 | 2018-02-17 | 1 | | A2 | 2018-02-17 | 2018-08-17 | 1 | | A2 | 2018-08-17 | 2019-02-17 | 1 | | A2 | 2019-02-17 | 2019-08-17 | 2 | | A2 | 2019-08-17 | 2020-02-17 | 2 | | A2 | 2020-02-17 | 2020-08-17 | 1 | +------+------------+------------+----------------------------+
The following SQL provide sets of parameters in the args
CTE term, and then JOIN
s with that CTE term in the final query expression. The key is the GROUP BY
clause which produces a result row for each set of parameters.
WITH args (p0, p1, p2) AS ( SELECT 'A1', '2018-07-02', '2018-07-02' UNION SELECT 'A1', '2016-11-02', '2016-11-02' UNION SELECT 'A2', '2016-11-02', '2016-11-02' ) SELECT p0, p1, p2, COUNT(DISTINCT service_no) FROM ServicesAll, args WHERE account_no = args.p0 AND InstallationDate < args.p1 AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2) GROUP BY p0, p1, p2 ;
Result:
+----+------------+------------+----------------------------+ | p0 | p1 | p2 | COUNT(DISTINCT service_no) | +----+------------+------------+----------------------------+ | A1 | 2016-11-02 | 2016-11-02 | 1 | | A1 | 2018-07-02 | 2018-07-02 | 4 | | A2 | 2016-11-02 | 2016-11-02 | 1 | +----+------------+------------+----------------------------+