Trying to write a sql query:
select indicator, count(distinct tid) as tidcount from coa group by indicator
below is normal output
indicator tidcount M 6219 Z 411424 S 1 I 1
I need row wise percentage output for tidcounts:
The query I’m trying is below
spark.sql(""" select indicator ,count(tid) as tidcount , round(round(count(indicator)/sum(count(indicator)) over (), 4)* 100, 4) as PERCENTAGE_TOTALS from coa group by indicator """)
indicator tidcount Percentage_total M 6219 0.72 Z 411424 98.78 S 1 .49 I 1 .02
expected output is:
indicator tidcount Percentage_total M 6219 1.4 Z 411424 98.5 S 1 .0002 I 1 .0002
Please suggest if i am missing anything it should be in either spark-sql or pyspark
Advertisement
Answer
Solution with spark.sql
spark.sql( """select indicator, COUNT(DISTINCT tid) AS tidcount, COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT from coa group by indicator""" )
Solution with pyspark
w = Window.partitionBy() ( df .groupby('indicator') .agg(F.count_distinct('tid').alias('tidcount')) .withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100) )
Example
df.show() +---------+---+ |indicator|tid| +---------+---+ | a| 10| | a| 25| | a| 7| | b| 10| | b| 10| | c| 25| | c| 7| | d| 1| | a| 2| | a| 3| +---------+---+
Result
+---------+--------+-----------------+ |indicator|tidcount| PCT| +---------+--------+-----------------+ | d| 1|11.11111111111111| | c| 2|22.22222222222222| | b| 1|11.11111111111111| | a| 5|55.55555555555556| +---------+--------+-----------------+