I am trying to count how many users are observed on each of the 3 consecutive days.
Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or
d1 or d2, which is 1 and indicates that the user is observed on that day).
The following query:
x
select d0,d1,d2, count(*) as user_count from (select uid, 1 as d0from my_tablewhere day=5 and uid is not Nullgroup by uid) as t0 full outer join (select uid, 1 as d1from my_tablewhere day=6 and uid is not Nullgroup by uid) as t1 on t0.uid = t1.uidfull outer join (select uid, 1 as d2from my_tablewhere day=7 and uid is not Nullgroup by uid) as t2 on t0.uid = t2.uid and t1.uid = t2.uidgroup by d0,d1,d2 order by d0,d1,d2produces this output from spark.sql(q).toPandas().set_index(["d0","d1","d2"]):
user_countd0 d1 d2 0 0 1 73455 1 0 533451 0 0 49254 1 0 8234 1 78455Two rows are obviously missing: 0 1 1 and 1 0 1. Why?!
PS1. I understand why 0 0 0 is missing.
PS2. my_table looks approximately like this:
create table my_table (uid integer, day integer);insert into my_table values (1, 5), (1, 6), (1, 7), (2, 5), (2, 6), (3, 5), (3, 7), (4, 6), (4, 7), (5, 5), (6, 6), (7, 7);For this table I expect the query to return
user_countd0 d1 d2 0 0 1 1 --- uid = 7 1 0 1 --- uid = 6 1 1 --- uid = 41 0 0 1 --- uid = 5 1 1 --- uid = 3 1 0 1 --- uid = 2 1 1 --- uid = 1Advertisement
Answer
Regarding the original query the last FULL JOIN should take into account t0.uid could be null due to the first FULL JOIN so it must be OR not AND.
select d0,d1,d2, count(*) as user_count from ( select uid, 1 as d0 from my_table where day=5 and uid is not Null group by uid) as t0 full outer join ( select uid, 1 as d1 from my_table where day=6 and uid is not Null group by uid) as t1 on t0.uid = t1.uidfull outer join ( select uid, 1 as d2 from my_table where day=7 and uid is not Null group by uid) as t2 on t0.uid = t2.uid or t1.uid = t2.uidgroup by d0,d1,d2 order by d0,d1,d2;Personally I would stick with the Gordon Linoff’s solution.