Skip to content
Advertisement

How to execute custom logic at pyspark window partition

I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2= Y, if both the flag i.e. flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME=personnel

I am able to get the desired result using joins but I am curious if we can do it using window functions as the dataset is quite huge in size.

+---------+------+------+-+------+
|  depName|flag_1|flag_2| result |
+---------+------+------+-+------+
|    sales|    N|  Y    |  Y    |
|    sales|    N|  N    |  Y    |
|    sales|    N|  N    |  Y    |
|personnel|    N|  N    |  N    |
|personnel|    N|  N    |  N    |
|  develop|    Y|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
+---------+-----+------+ +------+

Answer

This answers the original version of the question.

This looks like a case expression:

select t.*,
       (case when flag_1 = 'Y' or flag_2 = 'Y'
             then 'Y' else 'N'
        end) as result

For the updated version:

select t.*,
       max(case when flag_1 = 'Y' or flag_2 = 'Y'
                then 'Y' else 'N'
           end) over (partition by depname) as result