EDIT: Previous sample data included the duplicate visits column I need that calculated in the solution.
I am trying to determine the
total_visits = total visits per website per sub_group
duplicate_visits = visits-1
duplication_rate = duplicate_visits/ total_visits
distinct_users_subgroup = distinct users per website per sub_group
distinct_users_total = distinct users per website
for the sample data below which I hope to be a suitable reprex. The table is grouped at the website
and sub_group
levels and gives the number of visits
for each user. How can a query be written to get the results below?
Data:
+---------+---------+-----------+--------+ | website | user_id | sub_group | visits | +---------+---------+-----------+--------+ | web1 | 1 | A | 1 | | web1 | 1 | B | 3 | | web1 | 1 | C | 4 | | web1 | 1 | D | 1 | | web1 | 2 | A | 3 | | web1 | 2 | B | 4 | | web1 | 3 | B | 2 | | web1 | 3 | C | 1 | | web1 | 3 | D | 1 | | web2 | 1 | A | 3 | | web2 | 1 | B | 1 | | web2 | 1 | C | 2 | | web2 | 4 | B | 1 | | web2 | 4 | D | 2 | | web2 | 5 | C | 3 | | web2 | 5 | D | 4 | +---------+---------+-----------+--------+
Results:
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+ | website | sub_group | total_visits | duplicate_visits | duplicate_rate | distinct_users_subgroup | distinct_users_total | +---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+ | web1 | A | 4 | 2 | 0.50 | 2 | 3 | | web1 | B | 9 | 6 | 0.67 | 3 | 3 | | web1 | C | 5 | 3 | 0.60 | 2 | 3 | | web1 | D | 2 | 0 | 0.00 | 2 | 3 | | web2 | A | 3 | 2 | 0.67 | 1 | 3 | | web2 | B | 2 | 0 | 0.00 | 2 | 3 | | web2 | C | 5 | 3 | 0.60 | 2 | 3 | | web2 | D | 6 | 4 | 0.67 | 2 | 3 | +---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
Advertisement
Answer
You can use size(collect_set(user_id)) to count distinct users and the same can be calculated as analytic function to count distinct users for website.
SELECT website, sub_group, Sum(visits) AS total_visits, sum(visits-1) AS duplicate_visits, sum(visits-1)/Sum(visits) AS duplicate_rate, size(collect_set(user_id)) AS distinct_users_subgroup, size(collect_set(user_id)) over (partition by website) AS distinct_users_total FROM table GROUP BY website, sub_group;
Or if your Hive version does not support aggregation and analytics functions together in the same subquery, calculate distinct_users_total in the subquery:
SELECT website, sub_group, Sum(visits) AS total_visits, sum(visits-1) AS duplicate_visits, sum(visits-1)/Sum(visits) AS duplicate_rate, size(collect_set(user_id)) AS distinct_users_subgroup, distinct_users_total FROM (select t.*, size(collect_set(user_id)) over (partition by website) AS distinct_users_total from table t) s GROUP BY website, sub_group, distinct_users_total;