I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows
Table1 : one hash can be associated to several articles ; one article can be associated to several hashes
User Hash | Article Name |
---|---|
Hash1 | Article1 |
Hash1 | Article2 |
Hash2 | Article1 |
Hash3 | Article3 |
Table2 : each User Hash is uniquely associated to a User ID
User Hash | User ID |
---|---|
Hash1 | ID1 |
Hash2 | ID2 |
Hash3 | ID3 |
Table3 : each Article Name is uniquely associated to an Article Number
Article Name | Article number |
---|---|
Article1 | Number1 |
Article2 | Number2 |
Article3 | Number3 |
Table4
User ID | Article Number | OtherField |
---|---|---|
ID1 | Number1 | Misc1 |
ID2 | Number2 | Misc2 |
ID3 | Number3 | Misc3 |
I would like to keep in Table4, only the lines for which the combination (User ID, Article Number) is present in Table1 (as User Hash and Article Name). So in this example, I would like to get the following result :
User ID | Article Number | OtherField |
---|---|---|
ID1 | Number1 | Misc1 |
ID3 | Number3 | Misc3 |
What is the best way to do this in Spark SQL ? I have experimented using JOIN, but I am struggling with the fact that there are two conditions, which I would like both to be valid in a single row.
In my example, ID2 & Number2 are both in Table1, but not on the same row, so I want to filter this row out of Table4.
I hope the question is clear enough. Thanks in advance !
Advertisement
Answer
You can do a semi join:
select * from table4 left semi join ( select * from table1 join table2 using (`User Hash`) join table3 using (`Article Name`) ) using (`User ID`, `Article Number`) +-------+--------------+----------+ |User ID|Article Number|OtherField| +-------+--------------+----------+ | ID1| Number1| Misc1| | ID3| Number3| Misc3| +-------+--------------+----------+
Or equivalently, where exists
:
select * from table4 where exists ( select * from table1 join table2 using (`User Hash`) join table3 using (`Article Name`) where `User ID` = table4.`User ID` and `Article Number` = table4.`Article Number` )