so I’m having problems with creating tables, join, or at least I thought, when joining multiple tables resulting in duplicate result.
I have one big table which contains all the data. Also, I have Multiple sub-table(partitioned table) that was created from the big table and contains some information of the big table. I want the query which extract from the sub-tables to have the same result.
This is my query of how I created the 3 sub-tables. first table – traffic
create table `traffic` as select fullVisitorId, visitId, userId, max(case when c.index = 10 then c.value else null end) as universal_id, max(case when c.index = 8 then c.value else null end) as ps_project_id, max(case when c.index = 39 then c.value else null end) as ps_campaign_id, visitStartTime, date, visitNumber, adContent, campaign, campaignCode, keyword, medium, referralPath, source, channelGrouping, campaignId from( select fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, trafficSource.adContent, trafficSource.campaign, trafficSource.campaignCode, trafficSource.keyword, trafficSource.medium, trafficSource.referralPath, trafficSource.source, channelGrouping, trafficSource.adwordsClickInfo.campaignId, h.customdimensions from `ga_sessions_*` left join unnest (hits) as h WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' ) left join unnest (customdimensions) as c group by fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, adContent, campaign, campaignCode, keyword, medium, referralPath, source, channelGrouping, campaignId
Second table – hits page
create table `hits_page` as select fullVisitorId, visitId, userId, max(case when c.index = 10 then c.value else null end) as universal_id, max(case when c.index = 8 then c.value else null end) as ps_project_id, max(case when c.index = 39 then c.value else null end) as ps_campaign_id, visitStartTime, date, visitNumber, pagePath, pagePathLevel1, pagePathLevel2, pagePathLevel3, pagePathLevel4, hostname, pageTitle, searchKeyword, searchCategory from( select fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, h.page.pagePath, h.page.pagePathLevel1, h.page.pagePathLevel2, h.page.pagePathLevel3, h.page.pagePathLevel4, h.page.hostname, h.page.pageTitle, h.page.searchKeyword, h.page.searchCategory, h.customdimensions from `ga_sessions_*` left join unnest (hits) as h WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' ) left join unnest (customdimensions) as c group by fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, pagePath, pagePathLevel1, pagePathLevel2, pagePathLevel3, pagePathLevel4, hostname, pageTitle, searchKeyword, searchCategory
The third table – hits
create table `hits` as select fullVisitorId, visitId, userId, max(case when c.index = 10 then c.value else null end) as universal_id, max(case when c.index = 8 then c.value else null end) as ps_project_id, max(case when c.index = 39 then c.value else null end) as ps_campaign_id, visitStartTime, date, visitNumber, type, hitNumber, hour, minute, isEntrance, isExit, isInteraction, time, referer from( select fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, h.type, h.hitNumber, h.hour, h.minute, h.isEntrance, h.isExit, h.isInteraction, h.time, h.referer, h.customdimensions from `ga_sessions_*` left join unnest (hits) as h WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' ) left join unnest (customdimensions) as c group by fullVisitorId, visitId, userId, visitStartTime, date, visitNumber, type, hitNumber, hour, minute, isEntrance, isExit, isInteraction, time, referer
Now, the query which I used is the following:
select c.pagepath, a.medium, a.source, count(*) as count_pageviews from `traffic` as a join `hits_page` as c on a.fullVisitorId = c.fullVisitorId and a.visitId = c.visitId and a.visitStartTime = c.visitStartTime and a.date = c.date and a.visitNumber = c.visitNumber join `hits` as d on a.fullVisitorId = d.fullVisitorId and a.visitId = d.visitId and a.visitStartTime = d.visitStartTime and a.date = d.date and a.visitNumber = d.visitNumber where pagepath = "/sellland" and type = "PAGE" group by 1,2,3 order by count_pageviews desc
As you can see, I joined all columns which appear on 3 of these sub-table fullVisitorId
,visitId
,visitStartTime
,date
,visitNumber
except userId
, pscampaignid
, universal_Id
, ps_project_Id
which has no data and show no result when connected. This return the following result:
Row pagepath medium source count_pageviews 1 /sellland none direct 835 2 /sellland facebook_ad facebook 541 3 /sellland cpc google 390 4 /sellland referral lm.facebook.com 225
I have made updates to how I created those 3 tables, the results look closer now. I have a feeling that it was due to how I created the table, I will focus on that now.:
Row pagepath medium source count_pageviews 1 /sellland facebook_ad facebook 388 2 /sellland cpc google 252 3 /sellland none direct 182 4 /sellland referral lm.facebook.com 83
But, it should return something like this
Row pagepath medium source count_pageviews 1 /sellland facebook_ad facebook 357 2 /sellland cpc google 199 3 /sellland (none) (direct) 110 4 /sellland referral lm.facebook.com 48
I’m having trouble finding out the wrong in this problem. I am not sure if it is because the way I unnest
or the join
all the columns which appear on all of these 3 tables itself or others. Thank you in advance for all your inputs.
Advertisement
Answer
Alright guys, looks like I found an answer. I’m not sure if this is an absolute answer but it works for me so far, feel free to correct me if im wrong. So, I believe the problem here is not joining all the join keys causing the data to duplicate. I was trying to query the hits level data but did not join hit level key. Therefore, I did connect one more key which is hit.hitNumber to connect all hit level row to the surface. To anybody unfamiliar with this, I think in hit level there are arrays(data within data). It looks something like this. Each row has more than 1 data, so we need to unwrap the data to be usable. Hence, by unnesting and connecting said data with hit.hitNumber, I able to get the correct number compare to google analytics.
By unnesting, it should look like this.
Here is my code that got it working:
select c.pagepath, a.medium, a.source, count(*) as count_pageviews from `traffic` as a join `hits_page as c on a.fullVisitorId = c.fullVisitorId and a.visitId = c.visitId and a.visitStartTime = c.visitStartTime and a.date = c.date and a.visitNumber = c.visitNumber and a.hitNumber = c.hitNumber join `hits` as d on a.fullVisitorId = d.fullVisitorId and a.visitId = d.visitId and a.visitStartTime = d.visitStartTime and a.date = d.date and a.visitNumber = d.visitNumber and a.hitNumber = d.hitNumber where c.pagepath = "/sellland" and d.type = "PAGE" group by 1,2,3 order by count_pageviews desc