I have this dataset:
product customer date value buyer_position A 123455 2020-01-01 00:01:01 100 1 A 123456 2020-01-02 00:02:01 100 2 A 523455 2020-01-02 00:02:05 100 NULL A 323455 2020-01-03 00:02:07 100 NULL A 423455 2020-01-03 00:09:01 100 3 B 100455 2020-01-01 00:03:01 100 1 B 999445 2020-01-01 00:04:01 100 NULL B 122225 2020-01-01 00:04:05 100 2 B 993848 2020-01-01 10:04:05 100 3 B 133225 2020-01-01 11:04:05 100 NULL B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn’t have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before A 123455 2020-01-01 00:01:01 100 1 0 A 123456 2020-01-02 00:02:01 100 2 1 A 523455 2020-01-02 00:02:05 100 NULL 2 A 323455 2020-01-03 00:02:07 100 NULL 2 A 423455 2020-01-03 00:09:01 100 3 2 B 100455 2020-01-01 00:03:01 100 1 0 B 999445 2020-01-01 00:04:01 100 NULL 1 B 122225 2020-01-01 00:04:05 100 2 1 B 993848 2020-01-01 10:04:05 100 3 2 B 133225 2020-01-01 11:04:05 100 NULL 3 B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won’t get this “cumulative” information. So I’m kind of lost here.
Advertisement
Answer
This looks like a window count of non-null
values of buyer_position
over the preceding rows:
select t.*, coalesce(count(buyer_position) over( partition by product order by date rows between unbounded preceding and 1 preceding ), 0) as people_before from mytable t