Have table
import pandas as pd import numpy as np list_1=[['Steven',np.nan,'C1'], ['Michael',np.nan,'C2'], ['Robert',np.nan,'C3'], ['Buchanan',np.nan,'C1'], ['Suyama',np.nan,'C2'], ['King',np.nan,'C3']] labels=['first_name','last_name','class'] df=pd.DataFrame(list_1,columns=labels) df
OUT
first_name last_name class 0 Steven NaN C1 1 Michael NaN C2 2 Robert NaN C3 3 Buchanan NaN C1 4 Suyama NaN C2 5 King NaN C3
Need:
first_name last_name Steven Buchanan Michael Suyama Robert King
so i need make non equal join equivalent SQL query :
;with cte as ( SELECT first_name, class, ROW_NUMBER() OVER (partition by class ORDER BY first_name) as rn FROM students ) select c_fn.first_name, c_ln.first_name from cte c_fn join cte c_ln on c_fn.class=c_ln.class and c_ln.rn< c_fn.rn
or as SQL query:
;with cte as ( SELECT first_name, last_name, ROW_NUMBER() OVER ( ORDER BY (select null)) as rn FROM students ) select fn.first_name, ln.first_name as last_name from cte fn join cte ln on ln.rn=fn.rn+3
The problem in PANDAS is that NON EQUAL SELF JOIN cannot be done with MERGE. And I can’t find another way…..
Advertisement
Answer
We can solve this in pandas in a smarter way by using groupby
with agg
and joining the strings. Then we split
them to columns:
dfn = df.groupby('class').agg(' '.join)['first_name'].str.split(' ', expand=True) dfn.columns = [df.columns[:2]] dfn = dfn.reset_index(drop=True) first_name last_name 0 Steven Buchanan 1 Michael Suyama 2 Robert King