CREATE TABLE sales (
id SERIAL PRIMARY KEY,
event_date DATE,
customer VARCHAR,
orderID VARCHAR,
sales_volume DECIMAL
);
INSERT INTO sales
(event_date, customer, orderID, sales_volume)
VALUES
('2020-01-08', 'Customer_A', 'Order_001', '130'),
('2020-01-12', 'Customer_A', 'Order_002', '120'),
('2020-01-22', 'Customer_A', 'Order_003', '115'),
('2020-01-22', 'Customer_C', 'Order_001', '300'),
('2020-01-23', 'Customer_C', 'Order_002', '500'),
('2020-04-08', 'Customer_B', 'Order_001', '325'),
('2020-04-12', 'Customer_B', 'Order_002', '875'),
('2020-04-15', 'Customer_B', 'Order_003', '910'),
('2020-04-20', 'Customer_B', 'Order_004', '723'),
('2020-04-30', 'Customer_C', 'Order_003', '665'),
('2020-06-01', 'Customer_B', 'Order_005', '982'),
('2020-06-15', 'Customer_B', 'Order_006', '100'),
('2020-06-19', 'Customer_C', 'Order_004', '250'),
('2020-06-20', 'Customer_C', 'Order_005', '322'),
('2020-06-25', 'Customer_A', 'Order_004', '445');
Exptected Result:
customer | orderid | event_date | sales_volume -------------|---------------|--------------------|---------------------- Customer_A | Order_001 | 2020-01-08 | 130 Customer_A | Order_003 | 2020-01-22 | 115 Customer_C | Order_002 | 2020-01-23 | 500 Customer_C | Order_001 | 2020-01-22 | 300 -------------|---------------|--------------------|------------------------ Customer_B | Order_002 | 2020-04-12 | 875 Customer_B | Order_003 | 2020-04-15 | 910 Customer_C | Order_003 | 2020-04-30 | 665 -------------|---------------|--------------------|------------------------ Customer_A | Order_004 | 2020-06-25 | 445 Customer_B | Order_005 | 2020-06-01 | 982 Customer_B | Order_006 | 2020-06-15 | 100 Customer_C | Order_005 | 2020-06-20 | 322 Customer_C | Order_004 | 2020-06-19 | 250
I have a huge database and need to extract some data from it for a case study.
The problem is that I need to extract the full year of the data because I want to be able to conduct a monthly analysis in the case study. Therefore, I can not limit the extract with dates or LIMIT.
Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.
Do you have any idea if this is possible?
If yes, how do I need to modify the below query?
SELECT s.customer, s.orderID, s.event_date, SUM(s.sales_volume) AS sales_volume FROM sales s GROUP BY 1,2,3 ORDER BY 1,2,3;
Advertisement
Answer
Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.
You can use:
select s.*
from (select s.*,
row_number() over (partition by customer, date_trunc('month', event_date) order by random()) as seqnum
from sales s
) s
where seqnum <= 2;
To be honest, though, for analytic purposes, I would prefer to take a random sample of customers — say 1% or 5% — and all the transactions for them.