I have a dataframe pulled from a relational database. A one-to-many join has resulted in many similar rows with one column different. I would like to combine the similar rows but have the differing column data contained within a list, for each unique row. I am also able to change the SQL but I think this may be easier to do downstream within Pandas.
Here is a reproducable example:
# example of SQL to fetch data from upstream source
df = pd.read_sql('''
select
emp_no,
concat(first_name, ' ', last_name) as "name",
title
from employees
join titles using (emp_no)
limit 10;
''', engine)
# or create the DF from a dict() - to enable recreation of exact data stucture
df = pd.DataFrame({'emp_no': {0: 10001,
1: 10002,
2: 10003,
3: 10004,
4: 10004,
5: 10005,
6: 10005,
7: 10006,
8: 10007,
9: 10007},
'name': {0: 'Georgi Facello',
1: 'Bezalel Simmel',
2: 'Parto Bamford',
3: 'Chirstian Koblick',
4: 'Chirstian Koblick',
5: 'Kyoichi Maliniak',
6: 'Kyoichi Maliniak',
7: 'Anneke Preusig',
8: 'Tzvetan Zielinski',
9: 'Tzvetan Zielinski'},
'title': {0: 'Senior Engineer',
1: 'Staff',
2: 'Senior Engineer',
3: 'Engineer',
4: 'Senior Engineer',
5: 'Senior Staff',
6: 'Staff',
7: 'Senior Engineer',
8: 'Senior Staff',
9: 'Staff'}})
This outputs:
| name | title | |
|---|---|---|
| 10001 | Georgi Facello | Senior Engineer |
| 10002 | Bezalel Simmel | Staff |
| 10003 | Parto Bamford | Senior Engineer |
| 10004 | Chirstian Koblick | Senior Engineer |
| 10004 | Chirstian Koblick | Engineer |
| 10005 | Kyoichi Maliniak | Staff |
| 10005 | Kyoichi Maliniak | Senior Staff |
| 10006 | Anneke Preusig | Senior Engineer |
| 10007 | Tzvetan Zielinski | Staff |
| 10007 | Tzvetan Zielinski | Senior Staff |
I would like to be able to almagamate the similar rows, creating a list with all the different values for a particular record:
| name | titles | |
|---|---|---|
| 10001 | Georgi Facello | [Senior Engineer] |
| 10002 | Bezalel Simmel | [Staff] |
| 10003 | Parto Bamford | [Senior Engineer] |
| 10004 | Chirstian Koblick | [Engineer, Senior Engineer] |
| 10005 | Kyoichi Maliniak | [Senior Staff, Staff] |
| 10006 | Anneke Preusig | [Senior Engineer] |
| 10007 | Tzvetan Zielinski | [Senior Staff, Staff] |
What is the best (most readable, easiest to understand and maintain) way to achieve this, either in SQL or in Pandas? (I suspect this will be a lot easier in pandas but I can adjust the SQL to help)
Advertisement
Answer
dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index() dfnew
Output
emp_no name title 0 10001 Georgi Facello [Senior Engineer] 1 10002 Bezalel Simmel [Staff] 2 10003 Parto Bamford [Senior Engineer] 3 10004 Chirstian Koblick [Engineer, Senior Engineer] 4 10005 Kyoichi Maliniak [Senior Staff, Staff] 5 10006 Anneke Preusig [Senior Engineer] 6 10007 Tzvetan Zielinski [Senior Staff, Staff]