I want to conditionally fill the missing and update the value from another dataframe.
I want to fill missing and update the data on column values in dataframe smalldf.
The condition is, if the value in B column (large df) is in the range of columns Range_FROM and Range_TO in (small df). Always choose the minimum records in (largedf) to fill or update the values in smalldf.
- For example, in the second record of smalldf values (3), as there is a smaller value (1.3) in the largedf, it would end up being updated by a smaller value (1.3).
- Another example, the third record of smalldf (Nan) is being filled by 1.6, as it is the smallest value within the range of Range_FROM (1.5) and Range_TO (1.6)
This is the first dataframe (smalldf):
RoadNo | Range_FROM | Range_TO | values |
---|---|---|---|
A001 | 1.15 | 1.2 | 0.1 |
A001 | 1.35 | 1.4 | 3 |
A001 | 1.55 | 1.6 | Nan |
A001 | 1.75 | 1.8 | 0.1 |
A001 | 1.9 | 2 | Nan |
This is the second dataframe (largedf):
RoadNo | B | values |
---|---|---|
A001 | 1.1 | 0.2 |
A001 | 1.2 | 0.1 |
A001 | 1.3 | 1.9 |
A001 | 1.4 | 1.3 |
A001 | 1.5 | 1.6 |
A001 | 1.6 | 1.9 |
A001 | 1.7 | 0.2 |
A001 | 1.8 | 0.1 |
A001 | 1.9 | 1.9 |
A001 | 2 | 1.3 |
Below is the expected dataframe:
RoadNo | Range_FROM | Range_TO | values |
---|---|---|---|
A001 | 1.15 | 1.2 | 0.1 |
A001 | 1.35 | 1.4 | 1.3 |
A001 | 1.55 | 1.6 | 1.6 |
A001 | 1.75 | 1.8 | 0.1 |
A001 | 1.9 | 2 | 1.3 |
Below is the code to create the two dataframes:
smalldf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001"], "Range_FROM": [1.15, 1.35, 1.55, 1.75, 1.9], "Range_TO":[1.2, 1.4, 1.6, 1.8, 2], "values": [0.1, 0.25, "Nan", 0.1, "Nan" ]}) largedf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001"], "B": [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2], "values": [0.2, 0.1, 1.9, 1.3, 1.6, 1.9, 0.2, 0.1, 1.9, 1.3]})
Please keep in mind that there are other RoadNo (e.g. A002, A003).
Both left join and fillna(df) in pandas did not produce my desired result. Is there a function in Pandas or SQL that can help me with this operation?
Thank you very much!
Advertisement
Answer
In SQL your problem would look like:
SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2.values) FROM df1 LEFT JOIN df2 ON df1.RoadNo = df2.RoadNo AND df2.B >= df1.Range_FROM AND df2.B <= df1.Range_TO GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO
So we can do:
from pandasql import sqldf pysqldf = lambda q: sqldf(q, globals()) q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values" FROM df1 LEFT JOIN df2 ON df1.RoadNo = df2.RoadNo AND df2.B > df1.Range_LAG AND df2.B <= df1.Range_TO GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO''' pysqldf(q) df = pysqldf(q) print(df)
Output:
RoadNo Range_FROM Range_TO values 0 A001 1.15 1.2 0.1 1 A001 1.35 1.4 1.3 2 A001 1.55 1.6 1.9 3 A001 1.75 1.8 0.1 4 A001 1.90 2.0 1.3
But since you actually want something a little different, let’s try:
df1['Range_LAG'] = df1.Range_TO.shift(fill_value=0) q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values" FROM df1 LEFT JOIN df2 ON df1.RoadNo = df2.RoadNo AND df2.B > df1.Range_LAG AND df2.B <= df1.Range_TO GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO''' df = pysqldf(q) print(df)
Output:
RoadNo Range_FROM Range_TO values 0 A001 1.15 1.2 0.1 1 A001 1.35 1.4 1.3 2 A001 1.55 1.6 1.6 3 A001 1.75 1.8 0.1 4 A001 1.90 2.0 1.3