I have a tricky student work in spark. I need to write an SQL query for this kind of array:
x
+--------+------------+-------+
| Gender | Department | Loans |
+--------+------------+-------+
| Male | Maths | 1200 |
| Female | Maths | 1500 |
+--------+------------+-------+
There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the departments where female loans are greater than male loans?
Advertisement
Answer
You can group by department, pivot the gender column, and filter the results where female > male
. Below is a PySpark solution, but the solution for Scala should be identical except the part for creating the dataframe.
df = spark.createDataFrame([
['Male', 'Maths', 1200],
['Female', 'Maths', 1500],
['Male', 'Physics', 1800],
['Female', 'Physics', 1500]
]).toDF('Gender', 'Department', 'Loans')
df.show()
+------+----------+-----+
|Gender|Department|Loans|
+------+----------+-----+
| Male| Maths| 1200|
|Female| Maths| 1500|
| Male| Physics| 1800|
|Female| Physics| 1500|
+------+----------+-----+
grouped = df.groupBy("Department").pivot("Gender").sum("Loans")
grouped.show()
+----------+------+----+
|Department|Female|Male|
+----------+------+----+
| Physics| 1500|1800|
| Maths| 1500|1200|
+----------+------+----+
more_female = grouped.filter("Female > Male")
more_female.show()
+----------+------+----+
|Department|Female|Male|
+----------+------+----+
| Maths| 1500|1200|
+----------+------+----+