count of distinct columns using group by and calculating percentage

Question

Trying to write a sql query: below is normal output I need row wise percentage output for tidcounts: The query I&#8217;m trying is below expected output is: Please suggest if i am missing anything it should be in either spark-sql or pyspark Answer Solution with spark.sql Solution with pyspark Example Result

Accepted Answer

Solution with spark.sqlspark.sql(    """select            indicator,           COUNT(DISTINCT tid) AS tidcount,           COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT        from coa        group by indicator""")Solution with pysparkw = Window.partitionBy()(    df    .groupby('indicator')    .agg(F.count_distinct('tid').alias('tidcount'))    .withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100))Exampledf.show()+---------+---+|indicator|tid|+---------+---+|        a| 10||        a| 25||        a|  7||        b| 10||        b| 10||        c| 25||        c|  7||        d|  1||        a|  2||        a|  3|+---------+---+Result+---------+--------+-----------------+|indicator|tidcount|              PCT|+---------+--------+-----------------+|        d|       1|11.11111111111111||        c|       2|22.22222222222222||        b|       1|11.11111111111111||        a|       5|55.55555555555556|+---------+--------+-----------------+

count of distinct columns using group by and calculating percentage

Advertisement

Answer

Solution with `spark.sql`

Solution with `pyspark`

Example

Result