Is there any method to find number of columns having data in pyspark data frame

Question

I have a pyspark data frame that has 7 columns, I have to add a new column named &#8220;sum&#8221; and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer Answer This sum can be calculated like this: Hope this helps…

Accepted Answer

This sum can be calculated like this:df = spark.createDataFrame([    (1, "a", "xxx", None, "abc", "xyz","fgh"),     (2, "b", None, 3, "abc", "xyz","fgh"),    (3, "c", "a23", None, None, "xyz","fgh")], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))from pyspark.sql import functions as Ffrom pyspark.sql.types import IntegerTypedf2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))df2.show()+---+----+----+----+----+----+----+---+| ID|flag|col1|col2|col3|col4|col5|sum|+---+----+----+----+----+----+----+---+|  1|   a| xxx|null| abc| xyz| fgh|  6||  2|   b|null|   3| abc| xyz| fgh|  6||  3|   c| a23|null|null| xyz| fgh|  5|+---+----+----+----+----+----+----+---+Hope this helps!

Advertisement

Answer