so I’m having problems with creating tables, join, or at least I thought, when joining multiple tables resulting in duplicate result.
I have one big table which contains all the data. Also, I have Multiple sub-table(partitioned table) that was created from the big table and contains some information of the big table. I want the query which extract from the sub-tables to have the same result.
This is my query of how I created the 3 sub-tables. first table – traffic
create table `traffic` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
trafficSource.adContent,
trafficSource.campaign,
trafficSource.campaignCode,
trafficSource.keyword,
trafficSource.medium,
trafficSource.referralPath,
trafficSource.source,
channelGrouping,
trafficSource.adwordsClickInfo.campaignId,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
Second table – hits page
create table `hits_page` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.page.pagePath,
h.page.pagePathLevel1,
h.page.pagePathLevel2,
h.page.pagePathLevel3,
h.page.pagePathLevel4,
h.page.hostname,
h.page.pageTitle,
h.page.searchKeyword,
h.page.searchCategory,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
The third table – hits
create table `hits` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.type,
h.hitNumber,
h.hour,
h.minute,
h.isEntrance,
h.isExit,
h.isInteraction,
h.time,
h.referer,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
Now, the query which I used is the following:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page` as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
where pagepath = "/sellland"
and type = "PAGE"
group by 1,2,3
order by count_pageviews desc
As you can see, I joined all columns which appear on 3 of these sub-table fullVisitorId
,visitId
,visitStartTime
,date
,visitNumber
except userId
, pscampaignid
, universal_Id
, ps_project_Id
which has no data and show no result when connected. This return the following result:
Row pagepath medium source count_pageviews
1 /sellland none direct 835
2 /sellland facebook_ad facebook 541
3 /sellland cpc google 390
4 /sellland referral lm.facebook.com 225
I have made updates to how I created those 3 tables, the results look closer now. I have a feeling that it was due to how I created the table, I will focus on that now.:
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 388
2 /sellland cpc google 252
3 /sellland none direct 182
4 /sellland referral lm.facebook.com 83
But, it should return something like this
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 357
2 /sellland cpc google 199
3 /sellland (none) (direct) 110
4 /sellland referral lm.facebook.com 48
I’m having trouble finding out the wrong in this problem. I am not sure if it is because the way I unnest
or the join
all the columns which appear on all of these 3 tables itself or others. Thank you in advance for all your inputs.
Advertisement
Answer
Alright guys, looks like I found an answer. I’m not sure if this is an absolute answer but it works for me so far, feel free to correct me if im wrong. So, I believe the problem here is not joining all the join keys causing the data to duplicate. I was trying to query the hits level data but did not join hit level key. Therefore, I did connect one more key which is hit.hitNumber to connect all hit level row to the surface. To anybody unfamiliar with this, I think in hit level there are arrays(data within data). It looks something like this. Each row has more than 1 data, so we need to unwrap the data to be usable. Hence, by unnesting and connecting said data with hit.hitNumber, I able to get the correct number compare to google analytics.
By unnesting, it should look like this.
Here is my code that got it working:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
and a.hitNumber = c.hitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
and a.hitNumber = d.hitNumber
where c.pagepath = "/sellland"
and d.type = "PAGE"
group by 1,2,3
order by count_pageviews desc