I have a pyspark data frame that has 7 columns, I have to add a new column named “sum” and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer
Advertisement
Answer
This sum can be calculated like this:
df = spark.createDataFrame([ (1, "a", "xxx", None, "abc", "xyz","fgh"), (2, "b", None, 3, "abc", "xyz","fgh"), (3, "c", "a23", None, None, "xyz","fgh") ], ("ID","flag", "col1", "col2", "col3", "col4", "col5")) from pyspark.sql import functions as F from pyspark.sql.types import IntegerType df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns])) df2.show() +---+----+----+----+----+----+----+---+ | ID|flag|col1|col2|col3|col4|col5|sum| +---+----+----+----+----+----+----+---+ | 1| a| xxx|null| abc| xyz| fgh| 6| | 2| b|null| 3| abc| xyz| fgh| 6| | 3| c| a23|null|null| xyz| fgh| 5| +---+----+----+----+----+----+----+---+
Hope this helps!