Skip to content
Advertisement

How can I compare rows of data in an array based on distinct attributes of a column?

I have a tricky student work in spark. I need to write an SQL query for this kind of array:

+--------+------------+-------+
| Gender | Department | Loans |
+--------+------------+-------+
| Male   | Maths      |  1200 |
| Female | Maths      |  1500 |
+--------+------------+-------+

There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the departments where female loans are greater than male loans?

Advertisement

Answer

You can group by department, pivot the gender column, and filter the results where female > male. Below is a PySpark solution, but the solution for Scala should be identical except the part for creating the dataframe.

df = spark.createDataFrame([
    ['Male', 'Maths', 1200],
    ['Female', 'Maths', 1500],
    ['Male', 'Physics', 1800],
    ['Female', 'Physics', 1500]
]).toDF('Gender', 'Department', 'Loans')

df.show()
+------+----------+-----+
|Gender|Department|Loans|
+------+----------+-----+
|  Male|     Maths| 1200|
|Female|     Maths| 1500|
|  Male|   Physics| 1800|
|Female|   Physics| 1500|
+------+----------+-----+

grouped = df.groupBy("Department").pivot("Gender").sum("Loans")
grouped.show()
+----------+------+----+
|Department|Female|Male|
+----------+------+----+
|   Physics|  1500|1800|
|     Maths|  1500|1200|
+----------+------+----+

more_female = grouped.filter("Female > Male")
more_female.show()
+----------+------+----+
|Department|Female|Male|
+----------+------+----+
|     Maths|  1500|1200|
+----------+------+----+
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement