Skip to content
Advertisement

Conditional Join Between SQL Table and Pandas Dataframe

I am working on an application and one step of the processing involves joining drug NDC’s from claim data with a table that contains the NDC along with a variety of other information including unit price, multisource code, and more.

The table of claims data looks something like this:

Member ID | Claim ID | Date Filled | NDC | Qty | Days Supply |

It was read into a Pandas dataframe as follows:

import pandas as pd
claims = pd.read_excel(filepath)

The SQL table of drug pricing information has the following fields

| id | date | ndc | gpi | name | price | multisource_code |

So I need to be able to join them on NDC. I have tried reading in the SQL table as a Pandas dataframe and using that with merge or join to combine them together, but I cannot figure out how to get the appropriate conditions set up so that it joins only the NDC from the SQL table that has a date that is immediately before the date filled in the dataframe.
I also tried merge_asof as it seemed to provide the right paramters and it was setup as follow:

claims['NDC'] = claims['NDC'].astype('int64')
drugs['NDC'] = drugs['ndc'].astype('int64')
drugs.drop(columns='ndc', inplace=True)
sorted_frame = claims.sort_values('NDC')
sorted_drugs = drugs.sort_values('NDC')
merged_table = pd.merge_asof(sorted_frame, sorted_drugs, on='NDC', left_by='Date Filled', right_by='DATE')

However, it appears to only be returning nan’s for the right frame.

Any advice would be greatly appreciated.

Advertisement

Answer

Hi the answer was to use merge_asof. It was not working previously because I had not correctly casted the datetime. Relying on the infer_datetime_format paramter was causing the problem.

The final merge code was: merged_table = pd.merge_asof(claims, drugs, on=’DATE’, by=’NDC’)

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement