EDIT: Previous sample data included the duplicate visits column I need that calculated in the solution.
I am trying to determine the
total_visits = total visits per website per sub_group
duplicate_visits = visits-1
duplication_rate = duplicate_visits/ total_visits
distinct_users_subgroup = distinct users per website per sub_group
distinct_users_total = distinct users per website
for the sample data below which I hope to be a suitable reprex. The table is grouped at the website
and sub_group
levels and gives the number of visits
for each user. How can a query be written to get the results below?
Data:
+---------+---------+-----------+--------+
| website | user_id | sub_group | visits |
+---------+---------+-----------+--------+
| web1 | 1 | A | 1 |
| web1 | 1 | B | 3 |
| web1 | 1 | C | 4 |
| web1 | 1 | D | 1 |
| web1 | 2 | A | 3 |
| web1 | 2 | B | 4 |
| web1 | 3 | B | 2 |
| web1 | 3 | C | 1 |
| web1 | 3 | D | 1 |
| web2 | 1 | A | 3 |
| web2 | 1 | B | 1 |
| web2 | 1 | C | 2 |
| web2 | 4 | B | 1 |
| web2 | 4 | D | 2 |
| web2 | 5 | C | 3 |
| web2 | 5 | D | 4 |
+---------+---------+-----------+--------+
Results:
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| website | sub_group | total_visits | duplicate_visits | duplicate_rate | distinct_users_subgroup | distinct_users_total |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| web1 | A | 4 | 2 | 0.50 | 2 | 3 |
| web1 | B | 9 | 6 | 0.67 | 3 | 3 |
| web1 | C | 5 | 3 | 0.60 | 2 | 3 |
| web1 | D | 2 | 0 | 0.00 | 2 | 3 |
| web2 | A | 3 | 2 | 0.67 | 1 | 3 |
| web2 | B | 2 | 0 | 0.00 | 2 | 3 |
| web2 | C | 5 | 3 | 0.60 | 2 | 3 |
| web2 | D | 6 | 4 | 0.67 | 2 | 3 |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
Advertisement
Answer
You can use size(collect_set(user_id)) to count distinct users and the same can be calculated as analytic function to count distinct users for website.
SELECT website,
sub_group,
Sum(visits) AS total_visits,
sum(visits-1) AS duplicate_visits,
sum(visits-1)/Sum(visits) AS duplicate_rate,
size(collect_set(user_id)) AS distinct_users_subgroup,
size(collect_set(user_id)) over (partition by website) AS distinct_users_total
FROM table
GROUP BY website, sub_group;
Or if your Hive version does not support aggregation and analytics functions together in the same subquery, calculate distinct_users_total in the subquery:
SELECT website,
sub_group,
Sum(visits) AS total_visits,
sum(visits-1) AS duplicate_visits,
sum(visits-1)/Sum(visits) AS duplicate_rate,
size(collect_set(user_id)) AS distinct_users_subgroup,
distinct_users_total
FROM (select t.*,
size(collect_set(user_id)) over (partition by website) AS distinct_users_total
from table t) s
GROUP BY website, sub_group, distinct_users_total;