I’m working with AWS Personalize and one of the service Quotas is to have “At least 1000 records containing a min of 25 unique users with at least 2 records each”, I know my raw data has those numbers but I’m trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.
The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the “two records each” section by just counting the appearances, but I don’t know how to guarantee the min of 25 users.
It is important to say that my data is not shuffled in any way at the time of saving.
My query
SELECT C.productid AS ITEM_ID, A.userid AS USER_ID, A.createdon AS "TIMESTAMP", B.fromaddress_countryname AS "LOCATION" FROM A AS orders JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id JOIN C AS order_items ON orders.order_id = order_items.order_id WHERE orders.userid IN ( SELECT orders.userid FROM A AS ORDERS GROUP BY orders.userid HAVING count(*) > 2 ) LIMIT 10
I use the LIMIT to just query a subset since I’m in AWS Athena.
Advertisement
Answer
The IN
query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.
It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.
To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can’t use a window function in the where
clause, you will need a second CTE and a final query that queries it.
For example:
with users as ( select userid as good_users from orders group by 1 having count(*) > 1 -- this condition ensures at least 2 records ), cte as ( SELECT C.productid AS ITEM_ID, A.userid AS USER_ID, A.createdon AS "TIMESTAMP", B.fromaddress_countryname AS "LOCATION", count(distinct A.userid) over (rows between unbounded preceding and current row) as n_distinct_users FROM A AS orders JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id JOIN C AS order_items ON orders.order_id = order_items.order_id JOIN users on A.userid = users.userid --> ensure only users with 2 records order by A.userid -- needed for the window function ) select * from cte where n_distinct_users < 26
sorting over userid
in cte
will ensure that at least 2 records per userid will appear in the results.