Skip to content
Advertisement

determine duplication rate per group

EDIT: Previous sample data included the duplicate visits column I need that calculated in the solution.

I am trying to determine the

total_visits = total visits per website per sub_group

duplicate_visits = visits-1

duplication_rate = duplicate_visits/ total_visits

distinct_users_subgroup = distinct users per website per sub_group

distinct_users_total = distinct users per website

for the sample data below which I hope to be a suitable reprex. The table is grouped at the website and sub_group levels and gives the number of visits for each user. How can a query be written to get the results below?

Data:

+---------+---------+-----------+--------+
| website | user_id | sub_group | visits |
+---------+---------+-----------+--------+
| web1    |       1 | A         |      1 |
| web1    |       1 | B         |      3 |
| web1    |       1 | C         |      4 |
| web1    |       1 | D         |      1 |
| web1    |       2 | A         |      3 |
| web1    |       2 | B         |      4 |
| web1    |       3 | B         |      2 |
| web1    |       3 | C         |      1 |
| web1    |       3 | D         |      1 |
| web2    |       1 | A         |      3 |
| web2    |       1 | B         |      1 |
| web2    |       1 | C         |      2 |
| web2    |       4 | B         |      1 |
| web2    |       4 | D         |      2 |
| web2    |       5 | C         |      3 |
| web2    |       5 | D         |      4 |
+---------+---------+-----------+--------+

Results:

+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| website | sub_group | total_visits | duplicate_visits | duplicate_rate | distinct_users_subgroup | distinct_users_total |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| web1    | A         |            4 |                2 |           0.50 |                       2 |                    3 |
| web1    | B         |            9 |                6 |           0.67 |                       3 |                    3 |
| web1    | C         |            5 |                3 |           0.60 |                       2 |                    3 |
| web1    | D         |            2 |                0 |           0.00 |                       2 |                    3 |
| web2    | A         |            3 |                2 |           0.67 |                       1 |                    3 |
| web2    | B         |            2 |                0 |           0.00 |                       2 |                    3 |
| web2    | C         |            5 |                3 |           0.60 |                       2 |                    3 |
| web2    | D         |            6 |                4 |           0.67 |                       2 |                    3 |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+

Advertisement

Answer

You can use size(collect_set(user_id)) to count distinct users and the same can be calculated as analytic function to count distinct users for website.

SELECT website, 
       sub_group, 
       Sum(visits)                AS total_visits, 
       sum(visits-1)              AS duplicate_visits, 
       sum(visits-1)/Sum(visits)  AS duplicate_rate,
       size(collect_set(user_id)) AS distinct_users_subgroup,
       size(collect_set(user_id)) over (partition by website) AS distinct_users_total 
FROM table
GROUP BY website, sub_group;

Or if your Hive version does not support aggregation and analytics functions together in the same subquery, calculate distinct_users_total in the subquery:

SELECT website, 
       sub_group, 
       Sum(visits)                AS total_visits, 
       sum(visits-1)              AS duplicate_visits, 
       sum(visits-1)/Sum(visits)  AS duplicate_rate,
       size(collect_set(user_id)) AS distinct_users_subgroup,
       distinct_users_total 
FROM (select t.*,
             size(collect_set(user_id)) over (partition by website) AS distinct_users_total
        from table t) s
GROUP BY website, sub_group, distinct_users_total;
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement