Skip to content
Advertisement

How to join all keys correctly / How to identify join keys

so I’m having problems with creating tables, join, or at least I thought, when joining multiple tables resulting in duplicate result.

I have one big table which contains all the data. Also, I have Multiple sub-table(partitioned table) that was created from the big table and contains some information of the big table. I want the query which extract from the sub-tables to have the same result.

This is my query of how I created the 3 sub-tables. first table – traffic

 create table `traffic` as

select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId

from(

select 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
trafficSource.adContent,
trafficSource.campaign,
trafficSource.campaignCode,
trafficSource.keyword,
trafficSource.medium,
trafficSource.referralPath,
trafficSource.source,
channelGrouping,
trafficSource.adwordsClickInfo.campaignId,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId 

Second table – hits page

    create table `hits_page` as


select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory

from(

select 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
h.page.pagePath,
h.page.pagePathLevel1,
h.page.pagePathLevel2,
h.page.pagePathLevel3,
h.page.pagePathLevel4,
h.page.hostname,
h.page.pageTitle,
h.page.searchKeyword,
h.page.searchCategory,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory

The third table – hits

create table `hits` as

select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer

from(

select 
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.type,
h.hitNumber,
h.hour,
h.minute,
h.isEntrance,
h.isExit,
h.isInteraction,
h.time,
h.referer,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer

Now, the query which I used is the following:

select 
    c.pagepath,
    a.medium,
    a.source,
    count(*) as count_pageviews
 
from `traffic` as a

 join `hits_page` as c
 on a.fullVisitorId = c.fullVisitorId 
 and a.visitId = c.visitId 
 and a.visitStartTime = c.visitStartTime 
 and a.date = c.date
 and a.visitNumber = c.visitNumber

 join `hits` as d
 on a.fullVisitorId = d.fullVisitorId 
 and a.visitId = d.visitId 
 and a.visitStartTime = d.visitStartTime 
 and a.date = d.date
 and a.visitNumber = d.visitNumber
 
where pagepath = "/sellland"
and type = "PAGE"
 
group by 1,2,3
 
order by count_pageviews desc

As you can see, I joined all columns which appear on 3 of these sub-table fullVisitorId,visitId,visitStartTime,date,visitNumber except userId, pscampaignid, universal_Id, ps_project_Id which has no data and show no result when connected. This return the following result:

Row pagepath        medium      source  count_pageviews 
1   /sellland        none       direct        835
2   /sellland    facebook_ad  facebook        541
3   /sellland        cpc        google        390
4   /sellland      referral  lm.facebook.com  225

I have made updates to how I created those 3 tables, the results look closer now. I have a feeling that it was due to how I created the table, I will focus on that now.:

Row pagepath        medium      source     count_pageviews  
1   /sellland      facebook_ad  facebook         388
2   /sellland         cpc        google          252
3   /sellland         none       direct          182
4   /sellland      referral  lm.facebook.com     83

But, it should return something like this

Row pagepath       medium      source      count_pageviews  
1   /sellland   facebook_ad   facebook         357
2   /sellland       cpc        google          199
3   /sellland     (none)      (direct)         110
4   /sellland     referral   lm.facebook.com    48

I’m having trouble finding out the wrong in this problem. I am not sure if it is because the way I unnest or the join all the columns which appear on all of these 3 tables itself or others. Thank you in advance for all your inputs.

Advertisement

Answer

Alright guys, looks like I found an answer. I’m not sure if this is an absolute answer but it works for me so far, feel free to correct me if im wrong. So, I believe the problem here is not joining all the join keys causing the data to duplicate. I was trying to query the hits level data but did not join hit level key. Therefore, I did connect one more key which is hit.hitNumber to connect all hit level row to the surface. To anybody unfamiliar with this, I think in hit level there are arrays(data within data). It looks something like this. Each row has more than 1 data, so we need to unwrap the data to be usable. Hence, by unnesting and connecting said data with hit.hitNumber, I able to get the correct number compare to google analytics.

array image

By unnesting, it should look like this.

enter image description here

Here is my code that got it working:

select
   c.pagepath,
   a.medium,
   a.source,
   count(*) as count_pageviews
 
from `traffic` as a
 
join `hits_page as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
and a.hitNumber = c.hitNumber
 
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
and a.hitNumber = d.hitNumber
 
where  c.pagepath = "/sellland"
and d.type = "PAGE"
 
group by 1,2,3
 
order by count_pageviews desc

Reference: https://towardsdatascience.com/explore-arrays-and-structs-for-better-performance-in-google-bigquery-8978fb00a5bc

6 People found this is helpful
Advertisement