Not really sure how to title this question, but here’s the situation. I have one data frame (dfOrders) that has an order_id and basic information like so:
|order_id|full_name|order_date|billing|shipping| ------------------------------------------------ |1234567 |John Doe |1/1/2019 |Address|Address1| |1234567 |John Doe |1/1/2019 |Address|Address2|
Then there’s a second data frame (dfStandardized) that contains standardized address information:
|order_id|latitude |longitude |shippingZip|... ------------------------------------------- |1234567 |97.12345 |101.1245 |12345 |... |1234567 |98.98765 |102.9876 |12389 |...
And essentially the issue is that one customer placed one order but shipped it to two separate addresses. Because of this, there is only one order_id but two rows, one for each of the shipping addresses. What I would like is a data frame like so:
|order_id|full_name|order_date|billing|shipping|latitude |longitude |shippingZip|... --------------------------------------------------------------------------------- |1234567 |John Doe |1/1/2019 |Address|Address1|97.12345 |101.1245 |12345 |... |1234567 |John Doe |1/1/2019 |Address|Address2|98.98765 |102.9876 |12389 |...
Where the extra shipping info is only added to the row with the corresponding correct address, but what I’m getting is:
|order_id|full_name|order_date|billing|shipping|latitude |longitude |shippingZip|... --------------------------------------------------------------------------------- |1234567 |John Doe |1/1/2019 |Address|Address1|97.12345 |101.1245 |12345 |... |1234567 |John Doe |1/1/2019 |Address|Address2|98.98765 |102.9876 |12389 |... |1234567 |John Doe |1/1/2019 |Address|Address1|98.98765 |102.9876 |12389 |... |1234567 |John Doe |1/1/2019 |Address|Address2|97.12345 |101.1245 |12345 |...
Where both addresses are getting receiving the supplemental attribution of both addresses. This is obviously because it’s a many-to-many join due to only having one order_id. is there some way to get the result I want? Here’s the code I’m using:
import pandas as pd df = dfOrders.merge(dfStandardized, on = 'order_id', how = 'inner')
Advertisement
Answer
Usually we do cumcount
dfOrders['New']=dfOrders.groupby('order_id').cumcount() dfStandardized['New']=dfStandardized.groupby('order_id').cumcount() out=dfOrders.merge(dfStandardized, on = ['order_id','new'], how = 'inner').drop('New',1)