Skip to content
Advertisement

Filtering rows in pyspark dataframe and creating a new column that contains the result

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if it happened within the area, then it will have a label of “1” and “0” if not. After that, I am trying to create a new column to store those results. I tried my best to write everything I can but it just doesn’t work for some reason. Here is the code I wrote:

The error I am getting is:Picture for the error message

My guess is that the udf I am having right now doesn’t support the whole column as input to be compared, but I don’t know how to fix it to make it work. Please help! Thank you!

Advertisement

Answer

A sample data would have helped. For now I assume that your data looks like this:

Then you dont need a udf, as you can do the evaluation using the when() function

If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a struct

The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement