I have a table of phone calls consisting of user_id, call_date, city, where city can be either A or B. It looks like this:
user_id | call_date | city |
---|---|---|
1 | 2021-01-01 | A |
1 | 2021-01-02 | B |
1 | 2021-01-03 | B |
1 | 2021-01-05 | B |
1 | 2021-01-10 | A |
1 | 2021-01-12 | B |
1 | 2021-01-16 | A |
2 | 2021-01-17 | A |
2 | 2021-01-20 | B |
2 | 2021-01-22 | B |
2 | 2021-01-23 | A |
2 | 2021-01-24 | B |
2 | 2021-01-26 | B |
2 | 2021-01-30 | A |
For this table, we need to select for each user all the periods when he was in city B. These periods are counted in days and start when the first call is made from city B, and end as soon as the next call is made from city A. So for user_id = 1 fist period starts on 2021-01-02 and ands on 2021-01-10. There can be several such periods for each user.
The result should be the following table:
user_id | period_1 | period_2 |
---|---|---|
1 | 8 | 4 |
2 | 3 | 6 |
Can you please tell me how I can limit the periods according to the condition of the problem, and then calculate the datediff within each period? Thank you
Advertisement
Answer
This is a typical gaps and islands problem. You need to group consecutive rows first, then find the first call_date
of the next group. Sample code for Postgres is below, the same may be adapted to another DBMS by applying appropriate function to calculate the difference in days.
with a (user_id, call_date, city) as ( select * from ( values ('1', date '2021-01-01', 'A'), ('1', date '2021-01-02', 'B'), ('1', date '2021-01-03', 'B'), ('1', date '2021-01-05', 'B'), ('1', date '2021-01-10', 'A'), ('1', date '2021-01-12', 'B'), ('1', date '2021-01-16', 'A'), ('2', date '2021-01-17', 'A'), ('2', date '2021-01-20', 'B'), ('2', date '2021-01-22', 'B'), ('2', date '2021-01-23', 'A'), ('2', date '2021-01-24', 'B'), ('2', date '2021-01-26', 'B'), ('2', date '2021-01-30', 'A') ) as t ) , grp as ( /*Identify groups*/ select a.*, /*This is a grouping of consecutive rows: they will have the same difference between two row_numbers while the more detailed row_number changes, which means the attribute had changed. */ dense_rank() over( partition by user_id order by call_date asc ) - dense_rank() over( partition by user_id, city order by call_date asc ) as grp, /*Get next call date*/ lead(call_date, 1, call_date) over( partition by user_id order by call_date asc ) as next_dt from a ) select user_id, city, min(call_date) as dt_from, max(next_dt) as dt_to, max(next_dt) - min(call_date) as diff from grp where city = 'B' group by user_id, grp, city order by 1, 3user_id | city | dt_from | dt_to | diff :------ | :--- | :--------- | :--------- | ---: 1 | B | 2021-01-02 | 2021-01-10 | 8 1 | B | 2021-01-12 | 2021-01-16 | 4 2 | B | 2021-01-20 | 2021-01-23 | 3 2 | B | 2021-01-24 | 2021-01-30 | 6
db<>fiddle here