Skip to content
Advertisement

avoid duplication when joining tables without unique id using foreign keys

I’m facing this issue where I don’t really know how to handle duplicate rows when joining two tables.

I have two tables I’d like to join

Value_x table :

ID          Campaign      Value_x
foo         fb            1
foo         twitter       3
bar         fb            2
foobar      ads           2
foo         organic       3

Value_y table:

ID          Campaign      Value_y
foo         organic       42
foobar      organic       17
foo         ads           12
foo         twitter       23
bar         twitter       35

Because of the Campaign column, ID are not unique in each table and for example, left join table output looks like this:

ID          t1.Campaign  t2.Campaign      Value_x  Value_y
foo         fb           organic          1        42
foo         fb           ads              1        12

Ideally I’d like to have the following output

ID          t1.Campaign  t2.Campaign      Value_x  Value_y
foo         fb           null             1        null
foo         twitter      twitter          3        23
foo         organic      organic          3        42
foo         null         ads              null     12

How would you handle a situation like this? I read something about foreign keys… I’m not familiar with this concept how can it help in this context? Or is there a way to perform a join at the campaign level?

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT ID,
  x.Campaign AS Campaign_x,
  y.Campaign AS Campaign_y,
  Value_x,
  Value_y
FROM `project.dataset.tableX` x
FULL OUTER JOIN `project.dataset.tableY` y
USING(ID, Campaign)

You can test, play with above using sample data from your questions as in below example

#standardSQL
WITH `project.dataset.tableX` AS (
  SELECT 'foo' ID, 'fb' Campaign, 1 Value_x UNION ALL
  SELECT 'foo', 'twitter', 3 UNION ALL
  SELECT 'bar', 'fb', 2 UNION ALL
  SELECT 'foobar', 'ads', 2 UNION ALL
  SELECT 'foo', 'organic', 3 
), `project.dataset.tableY` AS (
  SELECT 'foo' ID, 'organic' Campaign, 42 Value_y UNION ALL
  SELECT 'foobar', 'organic', 17 UNION ALL
  SELECT 'foo', 'ads', 12 UNION ALL
  SELECT 'foo', 'twitter', 23 UNION ALL
  SELECT 'bar', 'twitter', 35 
)
SELECT ID,
  x.Campaign AS Campaign_x,
  y.Campaign AS Campaign_y,
  Value_x,
  Value_y
FROM `project.dataset.tableX` x
FULL OUTER JOIN `project.dataset.tableY` y
USING(ID, Campaign)
-- ORDER BY ID    

with result

Row ID      Campaign_x  Campaign_y  Value_x Value_y  
1   bar     fb          null        2       null     
2   bar     null        twitter     null    35   
3   foo     fb          null        1       null     
4   foo     twitter     twitter     3       23   
5   foo     organic     organic     3       42   
6   foo     null        ads         null    12   
7   foobar  ads         null        2       null     
8   foobar  null        organic     null    17     
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement