I have a dataframe pulled from a relational database. A one-to-many join has resulted in many similar rows with one column different. I would like to combine the similar rows but have the differing column data contained within a list, for each unique row. I am also able to change the SQL but I think this may be easier to do downstream within Pandas.
Here is a reproducable example:
x
# example of SQL to fetch data from upstream source
df = pd.read_sql('''
select
emp_no,
concat(first_name, ' ', last_name) as "name",
title
from employees
join titles using (emp_no)
limit 10;
''', engine)
# or create the DF from a dict() - to enable recreation of exact data stucture
df = pd.DataFrame({'emp_no': {0: 10001,
1: 10002,
2: 10003,
3: 10004,
4: 10004,
5: 10005,
6: 10005,
7: 10006,
8: 10007,
9: 10007},
'name': {0: 'Georgi Facello',
1: 'Bezalel Simmel',
2: 'Parto Bamford',
3: 'Chirstian Koblick',
4: 'Chirstian Koblick',
5: 'Kyoichi Maliniak',
6: 'Kyoichi Maliniak',
7: 'Anneke Preusig',
8: 'Tzvetan Zielinski',
9: 'Tzvetan Zielinski'},
'title': {0: 'Senior Engineer',
1: 'Staff',
2: 'Senior Engineer',
3: 'Engineer',
4: 'Senior Engineer',
5: 'Senior Staff',
6: 'Staff',
7: 'Senior Engineer',
8: 'Senior Staff',
9: 'Staff'}})
This outputs:
name | title | |
---|---|---|
10001 | Georgi Facello | Senior Engineer |
10002 | Bezalel Simmel | Staff |
10003 | Parto Bamford | Senior Engineer |
10004 | Chirstian Koblick | Senior Engineer |
10004 | Chirstian Koblick | Engineer |
10005 | Kyoichi Maliniak | Staff |
10005 | Kyoichi Maliniak | Senior Staff |
10006 | Anneke Preusig | Senior Engineer |
10007 | Tzvetan Zielinski | Staff |
10007 | Tzvetan Zielinski | Senior Staff |
I would like to be able to almagamate the similar rows, creating a list with all the different values for a particular record:
name | titles | |
---|---|---|
10001 | Georgi Facello | [Senior Engineer] |
10002 | Bezalel Simmel | [Staff] |
10003 | Parto Bamford | [Senior Engineer] |
10004 | Chirstian Koblick | [Engineer, Senior Engineer] |
10005 | Kyoichi Maliniak | [Senior Staff, Staff] |
10006 | Anneke Preusig | [Senior Engineer] |
10007 | Tzvetan Zielinski | [Senior Staff, Staff] |
What is the best (most readable, easiest to understand and maintain) way to achieve this, either in SQL or in Pandas? (I suspect this will be a lot easier in pandas but I can adjust the SQL to help)
Advertisement
Answer
dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index()
dfnew
Output
emp_no name title
0 10001 Georgi Facello [Senior Engineer]
1 10002 Bezalel Simmel [Staff]
2 10003 Parto Bamford [Senior Engineer]
3 10004 Chirstian Koblick [Engineer, Senior Engineer]
4 10005 Kyoichi Maliniak [Senior Staff, Staff]
5 10006 Anneke Preusig [Senior Engineer]
6 10007 Tzvetan Zielinski [Senior Staff, Staff]