I have a tricky student work in spark. I need to write an SQL query for this kind of array:
+--------+------------+-------+ | Gender | Department | Loans | +--------+------------+-------+ | Male | Maths | 1200 | | Female | Maths | 1500 | +--------+------------+-------+
There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the departments where female loans are greater than male loans?
Advertisement
Answer
You can group by department, pivot the gender column, and filter the results where female > male
. Below is a PySpark solution, but the solution for Scala should be identical except the part for creating the dataframe.
df = spark.createDataFrame([ ['Male', 'Maths', 1200], ['Female', 'Maths', 1500], ['Male', 'Physics', 1800], ['Female', 'Physics', 1500] ]).toDF('Gender', 'Department', 'Loans') df.show() +------+----------+-----+ |Gender|Department|Loans| +------+----------+-----+ | Male| Maths| 1200| |Female| Maths| 1500| | Male| Physics| 1800| |Female| Physics| 1500| +------+----------+-----+ grouped = df.groupBy("Department").pivot("Gender").sum("Loans") grouped.show() +----------+------+----+ |Department|Female|Male| +----------+------+----+ | Physics| 1500|1800| | Maths| 1500|1200| +----------+------+----+ more_female = grouped.filter("Female > Male") more_female.show() +----------+------+----+ |Department|Female|Male| +----------+------+----+ | Maths| 1500|1200| +----------+------+----+