Skip to content
Advertisement

SQL – Guarantee at least n unique users with 2 appearances each in query

I’m working with AWS Personalize and one of the service Quotas is to have “At least 1000 records containing a min of 25 unique users with at least 2 records each”, I know my raw data has those numbers but I’m trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.

The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the “two records each” section by just counting the appearances, but I don’t know how to guarantee the min of 25 users.

It is important to say that my data is not shuffled in any way at the time of saving.

My query

SELECT C.productid AS ITEM_ID,
    A.userid AS USER_ID,
    A.createdon AS "TIMESTAMP",
    B.fromaddress_countryname AS "LOCATION"
FROM A AS orders
    JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
    JOIN C AS order_items ON orders.order_id = order_items.order_id
WHERE orders.userid IN (
        SELECT orders.userid
        FROM A AS ORDERS
        GROUP BY orders.userid
        HAVING count(*) > 2
    )
LIMIT 10

I use the LIMIT to just query a subset since I’m in AWS Athena.

Advertisement

Answer

The IN query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.

It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.

To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can’t use a window function in the where clause, you will need a second CTE and a final query that queries it.

For example:

with users as (
    select userid as good_users 
    from orders 
    group by 1 
    having count(*) > 1 -- this condition ensures at least 2 records
),
cte as (
    SELECT C.productid AS ITEM_ID,
    A.userid AS USER_ID,
    A.createdon AS "TIMESTAMP",
    B.fromaddress_countryname AS "LOCATION",
    count(distinct A.userid) over (rows between unbounded preceding and current row) as n_distinct_users
FROM A AS orders
    JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
    JOIN C AS order_items ON orders.order_id = order_items.order_id
    JOIN users on A.userid = users.userid --> ensure only users with 2 records
order by A.userid -- needed for the window function 
)
select * from cte where n_distinct_users < 26

sorting over userid in cte will ensure that at least 2 records per userid will appear in the results.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement