I have the following Orders table in PostgreSQL Database
Order_date | Revenue -------------------- 2020-10-01 | 10 2020-10-02 | 5 2020-10-03 | 10 2020-10-04 | 5 2020-10-05 | 10
I need to return the cumulative sum of revenue for every order_date from last 2 days including the revenue for that order date. I am using the following query
SELECT o1.order_date, Sum(o2.revenue) as Revenue_sum FROM orders o1 JOIN orders o2 ON o1.order_date BETWEEN o2.order_date AND o2.order_date + 2 GROUP BY o1.order_date ORDER BY o1.order_date
and it returns following result
Order_date | Revenue_sum ------------------------ 2020-10-01 | 10 2020-10-02 | 15 2020-10-03 | 25 2020-10-04 | 20 2020-10-05 | 25
As per logical order of the query processing following steps will be performed
- The ‘JOIN’ will first form a Cartesian product by performing a cross join, so every row from o1 will join to every row of o2.
- Then the qualifier condition in ‘ON’ clause will select only the rows that satisfy the condition
- From the selected rows, the revenue will be summed up by each group from the GROUP BY clause (o1.order_date)
According to the execution steps above, I am trying to visualize the processing steps of my query.
Step 1 would be a Cross join as shown below
Step 2 would be qualification based on the condition in ‘ON’. I am having trouble in visualizing what rows will be selected from step 1 based on the condition specified in ‘JOIN’ and how
Step 3 would then group the rows and sum the revenue
1. Cartesian Product
o1.order_date | o2.order_date | o2.revenue ------------------------------------------- 2020-10-01 | 2020-10-01 | 10 2020-10-01 | 2020-10-02 | 5 2020-10-01 | 2020-10-03 | 10 2020-10-01 | 2020-10-04 | 5 2020-10-01 | 2020-10-05 | 10 2020-10-02 | 2020-10-01 | 10 2020-10-02 | 2020-10-02 | 5 2020-10-02 | 2020-10-03 | 10 2020-10-02 | 2020-10-04 | 5 2020-10-02 | 2020-10-05 | 10 2020-10-03 | 2020-10-01 | 10 2020-10-03 | 2020-10-02 | 5 2020-10-03 | 2020-10-03 | 10 2020-10-03 | 2020-10-04 | 5 2020-10-03 | 2020-10-05 | 10 2020-10-04 | 2020-10-01 | 10 2020-10-04 | 2020-10-02 | 5 2020-10-04 | 2020-10-03 | 10 2020-10-04 | 2020-10-04 | 5 2020-10-04 | 2020-10-05 | 10 2020-10-05 | 2020-10-01 | 10 2020-10-05 | 2020-10-02 | 5 2020-10-05 | 2020-10-03 | 10 2020-10-05 | 2020-10-04 | 5 2020-10-05 | 2020-10-05 | 10
2. Qualification based on ‘ON’ condition. What rows will be selected from step 1 above?
Advertisement
Answer
Once the cartesian product has been performed, every value of r1.date will be compared with the range of r2.date defined based on the condition you have provided (o1.order_date BETWEEN o2.order_date AND o2.order_date + 2). Remember for every value of o2.order_date, this date range will be redefined.
Example: When o1.order_date=’2020-10-01′:
- It will compare if o1.order_date lies within o2.order_date range between ‘2020-10-01’ and ‘2020-10-03’, the condition evaluates to True, and this row is selected from the cartesian product.
- Next time, o2.order_date range becomes ‘2020-10-02’ and ‘2020-10-04′, now order_date=’2020-10-01′ doesn’t lie within this range and hence this condition evaluates to false. Therefore, only 1 row (mentioned in previous step) from the cartesian product is selected for o1.order_date=’2020-10-01’.
The above steps are repeated unless all the rows in your cartesian product have been evaluated, and only the ones that satisfy the given date range condition will be selected to go in the group by clause for the aggregation of revenue.
Based on the above steps, following rows will be selected to go to the group-by clause:
o1.order_date | o2.order_date | o2.revenue ------------------------------------------- 2020-10-01 | 2020-10-01 | 10 2020-10-02 | 2020-10-01 | 10 2020-10-02 | 2020-10-02 | 5 2020-10-03 | 2020-10-01 | 10 2020-10-03 | 2020-10-02 | 5 2020-10-03 | 2020-10-03 | 10 ...