Skip to content
Advertisement

Fill NA and update columns from another dataframe

I want to conditionally fill the missing and update the value from another dataframe.

I want to fill missing and update the data on column values in dataframe smalldf.

The condition is, if the value in B column (large df) is in the range of columns Range_FROM and Range_TO in (small df). Always choose the minimum records in (largedf) to fill or update the values in smalldf.

  • For example, in the second record of smalldf values (3), as there is a smaller value (1.3) in the largedf, it would end up being updated by a smaller value (1.3).
  • Another example, the third record of smalldf (Nan) is being filled by 1.6, as it is the smallest value within the range of Range_FROM (1.5) and Range_TO (1.6)

This is the first dataframe (smalldf):

RoadNo Range_FROM Range_TO values
A001 1.15 1.2 0.1
A001 1.35 1.4 3
A001 1.55 1.6 Nan
A001 1.75 1.8 0.1
A001 1.9 2 Nan

This is the second dataframe (largedf):

RoadNo B values
A001 1.1 0.2
A001 1.2 0.1
A001 1.3 1.9
A001 1.4 1.3
A001 1.5 1.6
A001 1.6 1.9
A001 1.7 0.2
A001 1.8 0.1
A001 1.9 1.9
A001 2 1.3

Below is the expected dataframe:

RoadNo Range_FROM Range_TO values
A001 1.15 1.2 0.1
A001 1.35 1.4 1.3
A001 1.55 1.6 1.6
A001 1.75 1.8 0.1
A001 1.9 2 1.3

Below is the code to create the two dataframes:

smalldf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001"],   
                             "Range_FROM": [1.15, 1.35, 1.55, 1.75, 1.9], 
                             "Range_TO":[1.2, 1.4, 1.6, 1.8, 2],
                             "values": [0.1, 0.25, "Nan", 0.1, "Nan" ]})
                             

largedf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001"],   
                             "B": [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2],                             
                             "values": [0.2, 0.1, 1.9, 1.3, 1.6, 1.9, 0.2, 0.1, 1.9, 1.3]})

Please keep in mind that there are other RoadNo (e.g. A002, A003).

Both left join and fillna(df) in pandas did not produce my desired result. Is there a function in Pandas or SQL that can help me with this operation?

Thank you very much!

Advertisement

Answer

In SQL your problem would look like:

SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2.values)
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B >= df1.Range_FROM
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO

So we can do:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B > df1.Range_LAG
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''
pysqldf(q)
df = pysqldf(q)
print(df)

Output:

  RoadNo  Range_FROM  Range_TO  values
0   A001        1.15       1.2     0.1
1   A001        1.35       1.4     1.3
2   A001        1.55       1.6     1.9
3   A001        1.75       1.8     0.1
4   A001        1.90       2.0     1.3

But since you actually want something a little different, let’s try:

df1['Range_LAG'] = df1.Range_TO.shift(fill_value=0)

q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B > df1.Range_LAG
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''

df = pysqldf(q)
print(df)

Output:

  RoadNo  Range_FROM  Range_TO  values
0   A001        1.15       1.2     0.1
1   A001        1.35       1.4     1.3
2   A001        1.55       1.6     1.6
3   A001        1.75       1.8     0.1
4   A001        1.90       2.0     1.3
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement