Skip to content
Advertisement

determine duplication rate per group

EDIT: Previous sample data included the duplicate visits column I need that calculated in the solution.

I am trying to determine the

total_visits = total visits per website per sub_group

duplicate_visits = visits-1

duplication_rate = duplicate_visits/ total_visits

distinct_users_subgroup = distinct users per website per sub_group

distinct_users_total = distinct users per website

for the sample data below which I hope to be a suitable reprex. The table is grouped at the website and sub_group levels and gives the number of visits for each user. How can a query be written to get the results below?

Data:

Results:

Advertisement

Answer

You can use size(collect_set(user_id)) to count distinct users and the same can be calculated as analytic function to count distinct users for website.

Or if your Hive version does not support aggregation and analytics functions together in the same subquery, calculate distinct_users_total in the subquery:

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement