Skip to content
Advertisement

Spark SQL : filtering a table on records which appear in another table (two columns)?

I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows

Table1 : one hash can be associated to several articles ; one article can be associated to several hashes

User Hash Article Name
Hash1 Article1
Hash1 Article2
Hash2 Article1
Hash3 Article3

Table2 : each User Hash is uniquely associated to a User ID

User Hash User ID
Hash1 ID1
Hash2 ID2
Hash3 ID3

Table3 : each Article Name is uniquely associated to an Article Number

Article Name Article number
Article1 Number1
Article2 Number2
Article3 Number3

Table4

User ID Article Number OtherField
ID1 Number1 Misc1
ID2 Number2 Misc2
ID3 Number3 Misc3

I would like to keep in Table4, only the lines for which the combination (User ID, Article Number) is present in Table1 (as User Hash and Article Name). So in this example, I would like to get the following result :

User ID Article Number OtherField
ID1 Number1 Misc1
ID3 Number3 Misc3

What is the best way to do this in Spark SQL ? I have experimented using JOIN, but I am struggling with the fact that there are two conditions, which I would like both to be valid in a single row.

In my example, ID2 & Number2 are both in Table1, but not on the same row, so I want to filter this row out of Table4.

I hope the question is clear enough. Thanks in advance !

Advertisement

Answer

You can do a semi join:

select * 
from table4 
left semi join (
    select * from table1 
    join table2 using (`User Hash`) 
    join table3 using (`Article Name`)
) using (`User ID`, `Article Number`)

+-------+--------------+----------+
|User ID|Article Number|OtherField|
+-------+--------------+----------+
|    ID1|       Number1|     Misc1|
|    ID3|       Number3|     Misc3|
+-------+--------------+----------+

Or equivalently, where exists:

select * 
from table4 
where exists (
    select * from table1 
    join table2 using (`User Hash`) 
    join table3 using (`Article Name`) 
    where `User ID` = table4.`User ID` 
    and `Article Number` = table4.`Article Number`
)
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement