Filtering rows in pyspark dataframe and creating a new column that contains the result

Question

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if &#8230;

Accepted Answer

A sample data would have helped. For now I assume that your data looks like this:+----+---+---+|val1|  x|  y|+----+---+---+|  10|  7| 14||   5|  1|  4||   9|  8| 10||   2|  6| 90||   7|  2| 30||   3|  5| 11|+----+---+---+Then you dont need a udf, as you can do the evaluation using the when() functionimport pyspark.sql.functions as Ftst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result   tst_res.show()+----+---+---+----+|val1|  x|  y|isdt|+----+---+---+----+|  10|  7| 14|   1||   5|  1|  4|   0||   9|  8| 10|   0||   2|  6| 90|   0||   7|  2| 30|   0||   3|  5| 11|   1|+----+---+---+----+If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a structfrom pyspark.sql.functions import udffrom pyspark.sql.types import *@udf(IntegerType())def check_data(row):    if((row.x in range(4,5))&(row.y in range(1,20))):        return(1)    else:        return(0)tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.

Advertisement

Answer