Skip to content
Advertisement

Many to many join behaviour

Not really sure how to title this question, but here’s the situation. I have one data frame (dfOrders) that has an order_id and basic information like so:

|order_id|full_name|order_date|billing|shipping|
------------------------------------------------
|1234567 |John Doe |1/1/2019  |Address|Address1|
|1234567 |John Doe |1/1/2019  |Address|Address2|

Then there’s a second data frame (dfStandardized) that contains standardized address information:

|order_id|latitude |longitude |shippingZip|...
-------------------------------------------
|1234567 |97.12345 |101.1245  |12345      |...
|1234567 |98.98765 |102.9876  |12389      |...

And essentially the issue is that one customer placed one order but shipped it to two separate addresses. Because of this, there is only one order_id but two rows, one for each of the shipping addresses. What I would like is a data frame like so:

|order_id|full_name|order_date|billing|shipping|latitude |longitude |shippingZip|...
---------------------------------------------------------------------------------
|1234567 |John Doe |1/1/2019  |Address|Address1|97.12345 |101.1245  |12345      |...
|1234567 |John Doe |1/1/2019  |Address|Address2|98.98765 |102.9876  |12389      |...

Where the extra shipping info is only added to the row with the corresponding correct address, but what I’m getting is:

|order_id|full_name|order_date|billing|shipping|latitude |longitude |shippingZip|...
---------------------------------------------------------------------------------
|1234567 |John Doe |1/1/2019  |Address|Address1|97.12345 |101.1245  |12345      |...
|1234567 |John Doe |1/1/2019  |Address|Address2|98.98765 |102.9876  |12389      |...
|1234567 |John Doe |1/1/2019  |Address|Address1|98.98765 |102.9876  |12389      |...
|1234567 |John Doe |1/1/2019  |Address|Address2|97.12345 |101.1245  |12345      |...

Where both addresses are getting receiving the supplemental attribution of both addresses. This is obviously because it’s a many-to-many join due to only having one order_id. is there some way to get the result I want? Here’s the code I’m using:

import pandas as pd

df = dfOrders.merge(dfStandardized, on = 'order_id', how = 'inner')

Advertisement

Answer

Usually we do cumcount

dfOrders['New']=dfOrders.groupby('order_id').cumcount()
dfStandardized['New']=dfStandardized.groupby('order_id').cumcount()
out=dfOrders.merge(dfStandardized, on = ['order_id','new'], how = 'inner').drop('New',1)
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement