How can I compare rows of data in an array based on distinct attributes of a column?

Question

I have a tricky student work in spark. I need to write an SQL query for this kind of array: There are more departments and accordingly loans for each department both for males and females. How can I compute a new array where Female’s loans are more than Male’s loans per department and print/show only the depa…

Accepted Answer

You can group by department, pivot the gender column, and filter the results where female > male. Below is a PySpark solution, but the solution for Scala should be identical except the part for creating the dataframe.df = spark.createDataFrame([    ['Male', 'Maths', 1200],    ['Female', 'Maths', 1500],    ['Male', 'Physics', 1800],    ['Female', 'Physics', 1500]]).toDF('Gender', 'Department', 'Loans')df.show()+------+----------+-----+|Gender|Department|Loans|+------+----------+-----+|  Male|     Maths| 1200||Female|     Maths| 1500||  Male|   Physics| 1800||Female|   Physics| 1500|+------+----------+-----+grouped = df.groupBy("Department").pivot("Gender").sum("Loans")grouped.show()+----------+------+----+|Department|Female|Male|+----------+------+----+|   Physics|  1500|1800||     Maths|  1500|1200|+----------+------+----+more_female = grouped.filter("Female > Male")more_female.show()+----------+------+----+|Department|Female|Male|+----------+------+----+|     Maths|  1500|1200|+----------+------+----+

Advertisement

Answer