The source data is
x
user_id video_interest
1 [{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]
2 [{"category":"e","score":1},{"category":"f","score":2},{"category":"g","score":-3}]
The output is
user_id video_interest_top3
1 [{"category":"d","score":4},{"category":"c","score":3},{"category":"b","score":2}]
2 [{"category":"f","score":2},{"category":"e","score":1}]
I need to filter score>0, and then select each user_id’s top3 video_interest in descending order of score
Advertisement
Answer
Explode JSON array, extract score, calculate max score per user (if necessary to order final output by score desc) and row_number by score to filter top 3, collect array again and concatenate it to STRING if necessary. See comments in the code. I added sorting arrays and the whole output, because it was initially not clear what exactly should be sorted: arrays or final output, remove max_score sorting if you do not need it.
Demo:
with mytable as (
select stack(2,
1,'[{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]',
2,'[{"category":"e","score":1},{"category":"f","score":2},{"category":"g","score":-3}]'
) as (user_id,video_interest)
)
select --collect array and convert to JSON string
user_id, max_score, concat('[',concat_ws(',',collect_list(category_score)),']') as video_interest
from
(
select user_id, category_score, max_score, score
from
(
select --extract score, filter and sort
user_id, vi.category_score, get_json_object(vi.category_score, '$.score') as score,
row_number() over(partition by user_id order by get_json_object(vi.category_score, '$.score') desc) rn,
max(get_json_object(vi.category_score, '$.score')) over (partition by user_id) max_score
from
(--prepare for exploding array
select user_id, regexp_replace(regexp_replace(video_interest,'\[|\]',''), --remove []
'\},\{', '},,,{') as video_interest --replace , between array elements with ,,, to split
from mytable
)s
--split and explode
lateral view outer explode(split(video_interest,',,,')) vi as category_score
where get_json_object(vi.category_score, '$.score')>0
)s
where rn<=3 --filter top 3
distribute by user_id sort by score desc --Sort collection, remove if not necessary
)s
group by user_id, max_score
order by max_score desc --Sorting users by max_score desc, remove if not necessary
Result:
user_id max_score video_interest
1 4 [{"category":"d","score":4},{"category":"c","score":3},{"category":"b","score":2}]
2 2 [{"category":"f","score":2},{"category":"e","score":1}]