Skip to content
Advertisement

SQL – Guarantee at least n unique users with 2 appearances each in query

I’m working with AWS Personalize and one of the service Quotas is to have “At least 1000 records containing a min of 25 unique users with at least 2 records each”, I know my raw data has those numbers but I’m trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.

The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the “two records each” section by just counting the appearances, but I don’t know how to guarantee the min of 25 users.

It is important to say that my data is not shuffled in any way at the time of saving.

My query

I use the LIMIT to just query a subset since I’m in AWS Athena.

Advertisement

Answer

The IN query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.

It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.

To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can’t use a window function in the where clause, you will need a second CTE and a final query that queries it.

For example:

sorting over userid in cte will ensure that at least 2 records per userid will appear in the results.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement