Skip to content
Advertisement

Merge left on load data from BigQuery

I have an input table: input and one or more maptables, where input contains data for multiple identifiers and dates stacked under each other. The schemas are as follows:

#input
Id: string (might contain empty values)
Id2: string (might contain empty values)
Id3: string (might contain empty values)
Date: datetime
Value: number

#maptable_1
Id: string
Id2: string
Target_1: string

#maptable_2
Id3: string
Target_2: string

What I do now is that I run a pipeline that for each date/(id, id2, id3) combination loads the data from input and applies a left merge in python against one or more maptables (both as a DataFrame). I then stream the results to a third table named output with the schema:

#output
Id: string
Id2: string
Id3: string 
Date: datetime
Value: number
Target_1: string (from maptable_1)
Target_2: string (from maptable_2)
Target_x: ...

Now I was thinking that this is not really efficient. If I change one value from a maptable, I have to redo all the pipelines for each date/(id, id2, id3) combination.

  • Therefore I was wondering if its possible to apply directly a left merge when loading the data? How would such a Query look like?

  • In the case of multiple maptables and target columns, would it also be beneficial to do the same? Would the query not become too complex or unreadable, in particular as the id columns are not the same?

Advertisement

Answer

How would such a Query look like?

Below is for BigQuery Standard SQL

INSERT `project.dataset.output`  
SELECT *
FROM `project.dataset.input` i
LEFT JOIN `project.dataset.maptable_1` m1 USING(id, id2)
LEFT JOIN `project.dataset.maptable_2` m2 USING(id3)  

In the case of multiple maptables and target columns …

If all your map tables are same/similar to two maps in your question – in this case it is just extra LEFT JOIN for each extra map

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement