Skip to content
Advertisement

Is it possible to remove duplicates from the result for the data set?

I have two following tables, dim_customers and fact_daily_customer_shipments:

dim_customers 
+-------------+-----------------------+---------------------+
| customer_id | membership_start_date | membership_end_date |
+-------------+-----------------------+---------------------+
|         114 | 2015-01-01            | 2015-02-15          |
|         116 | 2015-02-01            | 2015-03-15          |
|         120 | 2015-02-15            | 2015-04-01          |
|         221 | 2015-03-15            | 2015-10-01          |
|         120 | 2015-05-15            | 2015-07-01          |
+-------------+-----------------------+---------------------+ 
fact_daily_customer_shipments 
+-------------+------------+-----------------------+----------+ 
| shipment_id | ship_date  |           customer_id | quantity |
+-------------+------------+-----------------------+----------+
|           1 | 2015-02-13 |                   114 |        2 |
|           2 | 2015-03-01 |                   116 |        1 |
|           3 | 2015-03-01 |                   120 |        6 |
|           4 | 2015-03-01 |                   321 |       10 |
|           5 | 2015-06-01 |                   116 |        1 |
|           6 | 2015-10-01 |                   120 |        3 |
+-------------+------------+-----------------------+----------+

Join them to get a table of the following schema:

fact_shipments_by_membership_status 
+-----------+-------------------+----------+
| ship_date | membership_status | quantity |
+-----------+-------------------+----------+ 

Example results:

+------------+-----------+-----+
| ship_date  | is_member | sum |
+------------+-----------+-----+
| 2015-02-13 | Y         |   2 |
| 2015-03-01 | N         |  10 |
| 2015-03-01 | Y         |   7 |
| 2015-06-01 | N         |   1 |
| 2015-10-01 | N         |   3 |
+------------+-----------+-----+

SQL I came up with,

select dc.ship_date, 
       case when dc.ship_date between dc.membership_start_date
                              and dc.membership_end_date then 'Y'
            else 'N'
       end as is_member, 
       sum(fc.quantity)
from dim_customers dc
    inner join fact_daily_customer_shipments fc on dc.customer_id = fc.customer_id

This SQL doesn’t make sense because I see duplicates in both the tables. Joining the table on key attributes customer_id is yielding duplicates.

Any thoughts what would the correct SQL approach would be?

Answer

The reason you are having issues with duplication is that you have two entries in the dim_customers table with the same customer_id value (but different membership dates). What this means is that you need to change the JOIN condition to include the membership_dates. By then changing to a LEFT JOIN, we can determine whether a customer was a member at the time by whether the customer_id value from the JOIN is NULL. So the query you should use is:

select fc.ship_date, 
       case when dc.customer_id is null then 'Y' else 'N' end as is_member, 
       sum(fc.quantity)
from fact_daily_customer_shipments fc
left join dim_customers dc on dc.customer_id = fc.customer_id and fc.ship_date between dc.membership_start_date and dc.membership_end_date
group by fc.ship_date, is_member

Output:

ship_date   is_member   sum(fc.quantity)
2015-02-13  N           2
2015-03-01  N           7
2015-03-01  Y           10
2015-06-01  Y           1
2015-10-01  Y           3

SQLFiddle Demo