I am working on an application and one step of the processing involves joining drug NDC’s from claim data with a table that contains the NDC along with a variety of other information including unit price, multisource code, and more.
The table of claims data looks something like this:
Member ID | Claim ID | Date Filled | NDC | Qty | Days Supply |
It was read into a Pandas dataframe as follows:
import pandas as pd claims = pd.read_excel(filepath)
The SQL table of drug pricing information has the following fields
| id | date | ndc | gpi | name | price | multisource_code |
So I need to be able to join them on NDC. I have tried reading in the SQL table as a Pandas dataframe and using that with merge or join to combine them together, but I cannot figure out how to get the appropriate conditions set up so that it joins only the NDC from the SQL table that has a date that is immediately before the date filled in the dataframe.
I also tried merge_asof as it seemed to provide the right paramters and it was setup as follow:
claims['NDC'] = claims['NDC'].astype('int64') drugs['NDC'] = drugs['ndc'].astype('int64') drugs.drop(columns='ndc', inplace=True) sorted_frame = claims.sort_values('NDC') sorted_drugs = drugs.sort_values('NDC') merged_table = pd.merge_asof(sorted_frame, sorted_drugs, on='NDC', left_by='Date Filled', right_by='DATE')
However, it appears to only be returning nan’s for the right frame.
Any advice would be greatly appreciated.
Advertisement
Answer
Hi the answer was to use merge_asof. It was not working previously because I had not correctly casted the datetime. Relying on the infer_datetime_format paramter was causing the problem.
The final merge code was: merged_table = pd.merge_asof(claims, drugs, on=’DATE’, by=’NDC’)