I have a table called “Sold_Items” like below. And I want to use Spark SQL to get the net sell volumes for each participant.
Item Buyer Seller Qty ---------------------------------- A JD Lidl 100 B SD JD 500 A Coop JD 125 C JD SD 300
Intermediate table
Item Participant Buy Sell -------------------------------------------- A JD 100 125 B JD 0 500 C JD 300 0 A Coop 125 0 A Lidl 0 100 B SD 500 0 C SD 0 300
Final result should look something like below.
Item Participant Net Sell ---------------------------------- A JD 25 B JD 500 C JD -300 A Coop -125 A Lidl 100 B SD -500 C SD 300
I have below two queries for buy and sell side of the first table.
Buy:
SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer
Sells:
SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller
I am trying to get the intermediate table so I can use that table to get the final result. But I cannot seem to join the two queries. Would appreciate any suggestions on combining the above two queries to get the intermediate table.
Advertisement
Answer
Unpivot and reaggregate. This is simplest with union all
:
select user, sum(buy_qty), sum(sell_qty) from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty from sold_items group by buyer ) union all (select seller as user, 0, sum(qty) from sold_items group by seller ) ) bs group by user;
Note that the aggregation in the subqueries is not really needed, so this will also work:
select user, sum(buy_qty), sum(sell_qty) from ((select buyer as user, qty as buy_qty, 0 as sell_qty from sold_items ) union all (select seller as user, 0, qty from sold_items ) ) bs group by user;
I would expect the multiple aggregation version to have better performance on large data sets — although the improvement might not be that big.