I have an input table: input
and one or more maptables
, where input contains data for multiple identifiers and dates stacked under each other. The schemas are as follows:
#input Id: string (might contain empty values) Id2: string (might contain empty values) Id3: string (might contain empty values) Date: datetime Value: number #maptable_1 Id: string Id2: string Target_1: string #maptable_2 Id3: string Target_2: string
What I do now is that I run a pipeline that for each date/(id, id2, id3)
combination loads the data from input
and applies a left merge in python against one or more maptables
(both as a DataFrame). I then stream the results to a third table named output
with the schema:
#output Id: string Id2: string Id3: string Date: datetime Value: number Target_1: string (from maptable_1) Target_2: string (from maptable_2) Target_x: ...
Now I was thinking that this is not really efficient. If I change one value from a maptable, I have to redo all the pipelines for each date/(id, id2, id3)
combination.
Therefore I was wondering if its possible to apply directly a left merge when loading the data? How would such a Query look like?
In the case of multiple maptables and target columns, would it also be beneficial to do the same? Would the query not become too complex or unreadable, in particular as the id columns are not the same?
Advertisement
Answer
How would such a Query look like?
Below is for BigQuery Standard SQL
INSERT `project.dataset.output` SELECT * FROM `project.dataset.input` i LEFT JOIN `project.dataset.maptable_1` m1 USING(id, id2) LEFT JOIN `project.dataset.maptable_2` m2 USING(id3)
In the case of multiple maptables and target columns …
If all your map tables are same/similar to two maps in your question – in this case it is just extra LEFT JOIN for each extra map