I have two following tables, dim_customers
and fact_daily_customer_shipments
:
dim_customers +-------------+-----------------------+---------------------+ | customer_id | membership_start_date | membership_end_date | +-------------+-----------------------+---------------------+ | 114 | 2015-01-01 | 2015-02-15 | | 116 | 2015-02-01 | 2015-03-15 | | 120 | 2015-02-15 | 2015-04-01 | | 221 | 2015-03-15 | 2015-10-01 | | 120 | 2015-05-15 | 2015-07-01 | +-------------+-----------------------+---------------------+ fact_daily_customer_shipments +-------------+------------+-----------------------+----------+ | shipment_id | ship_date | customer_id | quantity | +-------------+------------+-----------------------+----------+ | 1 | 2015-02-13 | 114 | 2 | | 2 | 2015-03-01 | 116 | 1 | | 3 | 2015-03-01 | 120 | 6 | | 4 | 2015-03-01 | 321 | 10 | | 5 | 2015-06-01 | 116 | 1 | | 6 | 2015-10-01 | 120 | 3 | +-------------+------------+-----------------------+----------+
Join them to get a table of the following schema:
fact_shipments_by_membership_status +-----------+-------------------+----------+ | ship_date | membership_status | quantity | +-----------+-------------------+----------+
Example results:
+------------+-----------+-----+ | ship_date | is_member | sum | +------------+-----------+-----+ | 2015-02-13 | Y | 2 | | 2015-03-01 | N | 10 | | 2015-03-01 | Y | 7 | | 2015-06-01 | N | 1 | | 2015-10-01 | N | 3 | +------------+-----------+-----+
SQL I came up with,
select dc.ship_date, case when dc.ship_date between dc.membership_start_date and dc.membership_end_date then 'Y' else 'N' end as is_member, sum(fc.quantity) from dim_customers dc inner join fact_daily_customer_shipments fc on dc.customer_id = fc.customer_id
This SQL doesn’t make sense because I see duplicates in both the tables. Joining the table on key attributes customer_id
is yielding duplicates.
Any thoughts what would the correct SQL approach would be?
Advertisement
Answer
The reason you are having issues with duplication is that you have two entries in the dim_customers
table with the same customer_id
value (but different membership dates). What this means is that you need to change the JOIN
condition to include the membership_dates. By then changing to a LEFT JOIN
, we can determine whether a customer was a member at the time by whether the customer_id
value from the JOIN is NULL
. So the query you should use is:
select fc.ship_date, case when dc.customer_id is null then 'Y' else 'N' end as is_member, sum(fc.quantity) from fact_daily_customer_shipments fc left join dim_customers dc on dc.customer_id = fc.customer_id and fc.ship_date between dc.membership_start_date and dc.membership_end_date group by fc.ship_date, is_member
Output:
ship_date is_member sum(fc.quantity) 2015-02-13 N 2 2015-03-01 N 7 2015-03-01 Y 10 2015-06-01 Y 1 2015-10-01 Y 3