Skip to content
Advertisement

Spark SQL to join two results from same table

I have a table called “Sold_Items” like below. And I want to use Spark SQL to get the net sell volumes for each participant.

Item     Buyer   Seller   Qty
----------------------------------
A        JD      Lidl     100
B        SD      JD       500
A        Coop    JD       125
C        JD      SD       300

Intermediate table

Item     Participant      Buy         Sell
--------------------------------------------
A        JD                100        125 
B        JD                0          500
C        JD                300          0
A        Coop              125          0
A        Lidl                0        100
B        SD                500          0    
C        SD                  0        300

Final result should look something like below.

Item     Participant      Net Sell
----------------------------------
A        JD                 25
B        JD                500
C        JD               -300
A        Coop             -125
A        Lidl              100
B        SD               -500  
C        SD                300

I have below two queries for buy and sell side of the first table.

Buy:

SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer

Sells:

SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller

I am trying to get the intermediate table so I can use that table to get the final result. But I cannot seem to join the two queries. Would appreciate any suggestions on combining the above two queries to get the intermediate table.

Advertisement

Answer

Unpivot and reaggregate. This is simplest with union all:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
       from sold_items
       group by buyer
      ) union all
      (select seller as user, 0, sum(qty)
       from sold_items
       group by seller
      )
     ) bs
group by user;

Note that the aggregation in the subqueries is not really needed, so this will also work:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
       from sold_items
      ) union all
      (select seller as user, 0, qty
       from sold_items
      )
     ) bs
group by user;

I would expect the multiple aggregation version to have better performance on large data sets — although the improvement might not be that big.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement