CREATE TABLE sales (
id SERIAL PRIMARY KEY,
event_date DATE,
customer VARCHAR,
orderID VARCHAR,
sales_volume DECIMAL
);
INSERT INTO sales
(event_date, customer, orderID, sales_volume)
VALUES
('2020-01-08', 'Customer_A', 'Order_001', '130'),
('2020-01-12', 'Customer_A', 'Order_002', '120'),
('2020-01-22', 'Customer_A', 'Order_003', '115'),
('2020-01-22', 'Customer_C', 'Order_001', '300'),
('2020-01-23', 'Customer_C', 'Order_002', '500'),
('2020-04-08', 'Customer_B', 'Order_001', '325'),
('2020-04-12', 'Customer_B', 'Order_002', '875'),
('2020-04-15', 'Customer_B', 'Order_003', '910'),
('2020-04-20', 'Customer_B', 'Order_004', '723'),
('2020-04-30', 'Customer_C', 'Order_003', '665'),
('2020-06-01', 'Customer_B', 'Order_005', '982'),
('2020-06-15', 'Customer_B', 'Order_006', '100'),
('2020-06-19', 'Customer_C', 'Order_004', '250'),
('2020-06-20', 'Customer_C', 'Order_005', '322'),
('2020-06-25', 'Customer_A', 'Order_004', '445');
Exptected Result:
customer | orderid | event_date | sales_volume
-------------|---------------|--------------------|----------------------
Customer_A | Order_001 | 2020-01-08 | 130
Customer_A | Order_003 | 2020-01-22 | 115
Customer_C | Order_002 | 2020-01-23 | 500
Customer_C | Order_001 | 2020-01-22 | 300
-------------|---------------|--------------------|------------------------
Customer_B | Order_002 | 2020-04-12 | 875
Customer_B | Order_003 | 2020-04-15 | 910
Customer_C | Order_003 | 2020-04-30 | 665
-------------|---------------|--------------------|------------------------
Customer_A | Order_004 | 2020-06-25 | 445
Customer_B | Order_005 | 2020-06-01 | 982
Customer_B | Order_006 | 2020-06-15 | 100
Customer_C | Order_005 | 2020-06-20 | 322
Customer_C | Order_004 | 2020-06-19 | 250
I have a huge database and need to extract some data from it for a case study.
The problem is that I need to extract the full year of the data because I want to be able to conduct a monthly analysis in the case study. Therefore, I can not limit the extract with dates
or LIMIT
.
Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.
Do you have any idea if this is possible?
If yes, how do I need to modify the below query?
SELECT
s.customer,
s.orderID,
s.event_date,
SUM(s.sales_volume) AS sales_volume
FROM sales s
GROUP BY 1,2,3
ORDER BY 1,2,3;
Advertisement
Answer
Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.
You can use:
select s.*
from (select s.*,
row_number() over (partition by customer, date_trunc('month', event_date) order by random()) as seqnum
from sales s
) s
where seqnum <= 2;
To be honest, though, for analytic purposes, I would prefer to take a random sample of customers — say 1% or 5% — and all the transactions for them.