Spark SQL to join two results from same table

I have a table called “Sold_Items” like below. And I want to use Spark SQL to get the net sell volumes for each participant.

Item     Buyer   Seller   Qty
----------------------------------
A        JD      Lidl     100
B        SD      JD       500
A        Coop    JD       125
C        JD      SD       300

​x
 
Item     Buyer   Seller   Qty----------------------------------A        JD      Lidl     100B        SD      JD       500A        Coop    JD       125C        JD      SD       300​

Intermediate table

Item     Participant      Buy         Sell
--------------------------------------------
A        JD                100        125 
B        JD                0          500
C        JD                300          0
A        Coop              125          0
A        Lidl                0        100
B        SD                500          0    
C        SD                  0        300

 
Item     Participant      Buy         Sell--------------------------------------------A        JD                100        125 B        JD                0          500C        JD                300          0A        Coop              125          0A        Lidl                0        100B        SD                500          0    C        SD                  0        300​

Final result should look something like below.

Item     Participant      Net Sell
----------------------------------
A        JD                 25
B        JD                500
C        JD               -300
A        Coop             -125
A        Lidl              100
B        SD               -500  
C        SD                300

 
Item     Participant      Net Sell----------------------------------A        JD                 25B        JD                500C        JD               -300A        Coop             -125A        Lidl              100B        SD               -500  C        SD                300​

I have below two queries for buy and sell side of the first table.

Buy:

SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer

 
SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer​

Sells:

SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller

 
SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller​

I am trying to get the intermediate table so I can use that table to get the final result. But I cannot seem to join the two queries. Would appreciate any suggestions on combining the above two queries to get the intermediate table.

Answer

Unpivot and reaggregate. This is simplest with union all:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
       from sold_items
       group by buyer
      ) union all
      (select seller as user, 0, sum(qty)
       from sold_items
       group by seller
      )
     ) bs
group by user;

 
select user, sum(buy_qty), sum(sell_qty)from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty       from sold_items       group by buyer      ) union all      (select seller as user, 0, sum(qty)       from sold_items       group by seller      )     ) bsgroup by user;​

Note that the aggregation in the subqueries is not really needed, so this will also work:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
       from sold_items
      ) union all
      (select seller as user, 0, qty
       from sold_items
      )
     ) bs
group by user;

 
select user, sum(buy_qty), sum(sell_qty)from ((select buyer as user, qty as buy_qty, 0 as sell_qty       from sold_items      ) union all      (select seller as user, 0, qty       from sold_items      )     ) bsgroup by user;​

I would expect the multiple aggregation version to have better performance on large data sets — although the improvement might not be that big.

Advertisement

Answer