Skip to content
Advertisement

Big Query data totals are wrong, suspect duplicate rows after LEFT JOIN

I’m trying to join two tables in Google Big Query and I think it’s duplicating a few rows (I’m guessing this because the totals are wrong when I then use the data in Data Studio).

To give some context, I’ve done something that probably isn’t the ideal setup but which seemed easiest for my level of knowledge. I first combined data from a bunch of different sources in Google Sheets before uploading to Big query, all of it is segmented by date but some sources are also segmented by device (they’ll be getting filtered into different reports so this isn’t a problem for me, although it might be relevant to the issue). All sources have the same columns though, just that some columns will be empty for certain sources (i.e. some not having device).

I then separately upload a set of data that I need to join with the data from a Google sheet based on the campaign name. While this works perfectly for the source that is segmented by device, the totals are all wrong (always too high) for the two sources that aren’t segmented by device.

SELECT master.*, additional.conversions, additional.campaign_c, additional.Date
FROM `project.dataset.Master` AS master
LEFT JOIN `project.dataset.Additional` AS additional
ON master.Campaign = additional.campaign_c AND master.Date = additional.Date AND master.Device = "Mobile" 
OR master.Campaign = additional.campaign_c AND master.Date = additional.Date AND master.Device IS NULL

I’m pretty sure the mistake has something to do with the last two lines, but nothing I try seems to fix it.

Edit: sample tables for clarity:

Master

  +------------+------------+------------+---------+----------+----------+
1 | Date       | Platform   | Campaign   | Device  | Metric 1 | Metric 2 |
  +------------+------------+------------+---------+----------+----------+
2 | 2019-10-19 | Platform 1 | Campaign 1 | Mobile  | 1        | 2        |
  +------------+------------+------------+---------+----------+----------+
3 | 2019-10-19 | Platform 1 | Campaign 1 | Desktop | 3        | 4        |
  +------------+------------+------------+---------+----------+----------+
4 | 2019-10-19 | Platform 1 | Campaign 2 | Tablet  | 56       | 7        |
  +------------+------------+------------+---------+----------+----------+
5 | 2019-10-19 | Platform 1 | Campaign 3 | Mobile  | 7        | 9        |
  +------------+------------+------------+---------+----------+----------+
6 | 2019-10-19 | Platform 2 | Campaign 4 | null    | 9        | 4        |
  +------------+------------+------------+---------+----------+----------+
7 | 2019-10-19 | Platform 2 | Campaign 5 | null    | 3        | 8        |
  +------------+------------+------------+---------+----------+----------+

Additional

+------------+------------+----------+
| Date       | Campaign_c | Metric 3 |
+------------+------------+----------+
| 2019-10-19 | Campaign 1 | 2        |
+------------+------------+----------+
| 2019-10-19 | Campaign 1 | 4        |
+------------+------------+----------+
| 2019-10-19 | Campaign 2 | 7        |
+------------+------------+----------+
| 2019-10-19 | Campaign 3 | 9        |
+------------+------------+----------+
| 2019-10-19 | Campaign 4 | 4        |
+------------+------------+----------+
| 2019-10-19 | Campaign 5 | 8        |
+------------+------------+----------+

In this scenario I’d ideally want to add Metric 3 to rows 2, 5, 6, and 7 but leave the other rows alone.

Advertisement

Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT r.*, Metric_3 FROM (
  SELECT ANY_VALUE(m) r, ARRAY_AGG(Metric_3) AS Metric_3
  FROM `project.dataset.master` m
  LEFT JOIN `project.dataset.additional` a
  USING (Campaign, `Date`)
  WHERE IFNULL(Device, 'Mobile') = 'Mobile' 
  GROUP BY FORMAT('%t', m)
)

if to apply to sample data in your question – result is

Row date        Platform    Campaign    Device  Metric_1    Metric_2    Metric_3     
1   2019-10-19  Platform 1  Campaign 1  Mobile  1           2           2    
                                                                        4    
2   2019-10-19  Platform 1  Campaign 3  Mobile  7           9           9    
3   2019-10-19  Platform 2  Campaign 4  null    9           4           4    
4   2019-10-19  Platform 2  Campaign 5  null    3           8           8    

I’d still like the other rows to be kept in the data (rows 3 and 4), i jsut don’t want them to get any of the additional data added to them

Use below

#standardSQL
SELECT r.*, IF(IFNULL(r.Device, 'Mobile') = 'Mobile', Metric_3, []) AS Metric_3 
FROM (
  SELECT ANY_VALUE(m) r, ARRAY_AGG(Metric_3) AS Metric_3
  FROM `project.dataset.master` m
  LEFT JOIN `project.dataset.additional` a
  USING (Campaign, `Date`)
  GROUP BY FORMAT('%t', m)
)

this produces

Row date        Platform    Campaign    Device  Metric_1    Metric_2    Metric_3     
1   2019-10-19  Platform 1  Campaign 1  Mobile  1           2           2    
                                                                        4    
2   2019-10-19  Platform 1  Campaign 1  Desktop 3           4        
3   2019-10-19  Platform 1  Campaign 2  Tablet  56          7        
4   2019-10-19  Platform 1  Campaign 3  Mobile  7           9           9    
5   2019-10-19  Platform 2  Campaign 4  null    9           4           4    
6   2019-10-19  Platform 2  Campaign 5  null    3           8           8    

getting an error because ‘Campaign in USING clause not found on right side of table’ – pretty sure this is because the right table has ‘campaign_c’ instead of ‘campaign’

If join columns names are different – use ON instead of USING as in example below

#standardSQL
SELECT r.*, IF(IFNULL(r.Device, 'Mobile') = 'Mobile', Metric_3, []) AS Metric_3 
FROM (
  SELECT ANY_VALUE(m) r, ARRAY_AGG(Metric_3) AS Metric_3
  FROM `project.dataset.master` m
  LEFT JOIN `project.dataset.additional` a
  ON m.Campaign = a.campaign_c AND m.Date = a.Date
  GROUP BY FORMAT('%t', m)
)
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement