Skip to content
Advertisement

Randomly select orders per customer from database

DB-Fiddle

CREATE TABLE sales (
    id SERIAL PRIMARY KEY,
    event_date DATE,
    customer VARCHAR,
    orderID VARCHAR,
    sales_volume DECIMAL
);

INSERT INTO sales
(event_date, customer, orderID, sales_volume)
VALUES 
('2020-01-08', 'Customer_A', 'Order_001', '130'),
('2020-01-12', 'Customer_A', 'Order_002', '120'),
('2020-01-22', 'Customer_A', 'Order_003', '115'),
('2020-01-22', 'Customer_C', 'Order_001', '300'),
('2020-01-23', 'Customer_C', 'Order_002', '500'),

('2020-04-08', 'Customer_B', 'Order_001', '325'),
('2020-04-12', 'Customer_B', 'Order_002', '875'),
('2020-04-15', 'Customer_B', 'Order_003', '910'),
('2020-04-20', 'Customer_B', 'Order_004', '723'),
('2020-04-30', 'Customer_C', 'Order_003', '665'),

('2020-06-01', 'Customer_B', 'Order_005', '982'),
('2020-06-15', 'Customer_B', 'Order_006', '100'),
('2020-06-19', 'Customer_C', 'Order_004', '250'),
('2020-06-20', 'Customer_C', 'Order_005', '322'),
('2020-06-25', 'Customer_A', 'Order_004', '445');

Exptected Result:

customer     |   orderid     |      event_date    |     sales_volume
-------------|---------------|--------------------|----------------------
Customer_A   |   Order_001   |      2020-01-08    |       130      
Customer_A   |   Order_003   |      2020-01-22    |       115
Customer_C   |   Order_002   |      2020-01-23    |       500
Customer_C   |   Order_001   |      2020-01-22    |       300
-------------|---------------|--------------------|------------------------
Customer_B   |   Order_002   |      2020-04-12    |       875
Customer_B   |   Order_003   |      2020-04-15    |       910
Customer_C   |   Order_003   |      2020-04-30    |       665
-------------|---------------|--------------------|------------------------
Customer_A   |   Order_004   |      2020-06-25    |       445
Customer_B   |   Order_005   |      2020-06-01    |       982
Customer_B   |   Order_006   |      2020-06-15    |       100
Customer_C   |   Order_005   |      2020-06-20    |       322
Customer_C   |   Order_004   |      2020-06-19    |       250

I have a huge database and need to extract some data from it for a case study.
The problem is that I need to extract the full year of the data because I want to be able to conduct a monthly analysis in the case study. Therefore, I can not limit the extract with dates or LIMIT.

Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.

Do you have any idea if this is possible?
If yes, how do I need to modify the below query?

SELECT
s.customer,
s.orderID,
s.event_date,
SUM(s.sales_volume) AS sales_volume
FROM sales s
GROUP BY 1,2,3
ORDER BY 1,2,3;

Advertisement

Answer

Thus, my idea to solve this issue is a query which extracts randomly maximal two orders per customer per month.

You can use:

select s.*
from (select s.*,
             row_number() over (partition by customer, date_trunc('month', event_date) order by random()) as seqnum
      from sales s
     ) s
where seqnum <= 2;

To be honest, though, for analytic purposes, I would prefer to take a random sample of customers — say 1% or 5% — and all the transactions for them.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement