Skip to content
Advertisement

Conditional join, spark scala (between limits)

I have two dataframes.

df1:

Team, Sport, CostTicket
Stars, Fotball, 10
Circles, Fotball, 20
Stars, Basket, 12
Stars, Baseball, 14
Circles, Baseball, 25

and

df2:

Team, Sport, CostRange, LowerLimit, UpperLimit
Stars, Football, 0<3, 0, 3
Stars, Football, 4<10, 4, 10
Stars, Football, 11<22, 11, 22
Stars, Football, 24<25, 24, 25
Circles, Football, 0<4, 0, 4
Circles, Football, 5<10, 5, 10
Circles, Football, 11<20, 11, 20
Circles, Football, 21<30, 21, 30
Stars, Basket, 0<2, 0, 2
Stars, Basket, 3<7, 3, 7
Stars, Basket, 8<19, 8, 19
Stars, Basket, 20<30, 20, 30
Circles, Basket, 0<1, 0, 1
Circles, Basket, 2<4, 2, 4
Circles, Basket, 5<15, 5, 15
Circles, Basket, 16<30, 16, 30
Stars, Baseball, 0<10, 0, 10
Stars, Baseball, 11<20, 11, 20
Stars, Baseball, 21<30, 21, 30
Circles, Baseball, 0<4, 0, 4
Circles, Baseball, 5<10, 5, 10
Circles, Baseball, 11<20, 11, 20
Circles, Baseball, 21<30, 21, 30

I want to add a fourth column to df1 with the CostRange from df2.

The final result should be:

Team, Sport, CostTicket, Range
Stars, Fotball, 10, 5<10
Circles, Fotball, 20, 11<22
Stars, Basket, 12, 8<19
Stars, Baseball, 14, 11<20
Circles, Baseball, 25, 21<30

I have come this far, but it does not work. Is there someone that can help me with this?

val df1 = df2.withColumn("Range", df2("CostRange"))
  .where(df1("CostTicket") > df2("LowerLimit"))
  .where(df1("CostTicket") < df2("UpperLimit"))
  .where(df1("Team") === df2("Team"))
  .where(df1("Sport") === df2("Sport"))

Advertisement

Answer

You can’t select two columns from two different dataframes, You need to join two dataframes first

You can join with two columns first and use where as below

df1.join(df2, Seq("Team", "Sport"))
    .where($"CostTicket"  >= $"LowerLimit" && $"CostTicket" <= $"UpperLimit")

Or you could specify in join condition itself as below

df1.join(df2,
    df1("Team") === df2("Team") &&
    df1("Sport") === df2("Sport") &&
    df1("CostTicket") >= df2("LowerLimit") &&
    df1("CostTicket") <= df2("UpperLimit")
).drop(df2("Team"))
 .drop(df2("Sport"))

Output:

+-------+--------+----------+---------+----------+----------+
|Team   |Sport   |CostTicket|CostRange|LowerLimit|UpperLimit|
+-------+--------+----------+---------+----------+----------+
|Stars  |Football|10        |4<10     |4         |10        |
|Circles|Football|20        |11<20    |11        |20        |
|Stars  |Basket  |12        |8<19     |8         |19        |
|Stars  |Baseball|14        |11<20    |11        |20        |
|Circles|Baseball|25        |21<30    |21        |30        |
+-------+--------+----------+---------+----------+----------+
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement