Skip to content
Advertisement

How can I create a column which computes only the change of other column on redshift?

I have this dataset:

product   customer    date                        value     buyer_position
A         123455      2020-01-01 00:01:01         100       1
A         123456      2020-01-02 00:02:01         100       2
A         523455      2020-01-02 00:02:05         100       NULL
A         323455      2020-01-03 00:02:07         100       NULL
A         423455      2020-01-03 00:09:01         100       3
B         100455      2020-01-01 00:03:01         100       1
B         999445      2020-01-01 00:04:01         100       NULL
B         122225      2020-01-01 00:04:05         100       2
B         993848      2020-01-01 10:04:05         100       3
B         133225      2020-01-01 11:04:05         100       NULL
B         144225      2020-01-01 12:04:05         100       4

The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn’t have any repetition. I want to get how many people bought the product before the customer sees it.

This would be the perfect output:

product   customer    date                        value     buyer_position     people_before
A         123455      2020-01-01 00:01:01         100       1                  0
A         123456      2020-01-02 00:02:01         100       2                  1
A         523455      2020-01-02 00:02:05         100       NULL               2
A         323455      2020-01-03 00:02:07         100       NULL               2
A         423455      2020-01-03 00:09:01         100       3                  2
B         100455      2020-01-01 00:03:01         100       1                  0
B         999445      2020-01-01 00:04:01         100       NULL               1
B         122225      2020-01-01 00:04:05         100       2                  1
B         993848      2020-01-01 10:04:05         100       3                  2
B         133225      2020-01-01 11:04:05         100       NULL               3
B         144225      2020-01-01 12:04:05         100       4                  3

As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.

I think I should use some window function, like lag(). But lag() function won’t get this “cumulative” information. So I’m kind of lost here.

Advertisement

Answer

This looks like a window count of non-null values of buyer_position over the preceding rows:

select t.*,
    coalesce(count(buyer_position) over(
        partition by product
        order by date
        rows between unbounded preceding and 1 preceding
    ), 0) as people_before
from mytable t
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement