I have a dataframe pulled from a relational database. A one-to-many join has resulted in many similar rows with one column different. I would like to combine the similar rows but have the differing column data contained within a list, for each unique row. I am also able to change the SQL but I think this may be easier to do downstream within Pandas.
Here is a reproducable example:
# example of SQL to fetch data from upstream source df = pd.read_sql(''' select emp_no, concat(first_name, ' ', last_name) as "name", title from employees join titles using (emp_no) limit 10; ''', engine) # or create the DF from a dict() - to enable recreation of exact data stucture df = pd.DataFrame({'emp_no': {0: 10001, 1: 10002, 2: 10003, 3: 10004, 4: 10004, 5: 10005, 6: 10005, 7: 10006, 8: 10007, 9: 10007}, 'name': {0: 'Georgi Facello', 1: 'Bezalel Simmel', 2: 'Parto Bamford', 3: 'Chirstian Koblick', 4: 'Chirstian Koblick', 5: 'Kyoichi Maliniak', 6: 'Kyoichi Maliniak', 7: 'Anneke Preusig', 8: 'Tzvetan Zielinski', 9: 'Tzvetan Zielinski'}, 'title': {0: 'Senior Engineer', 1: 'Staff', 2: 'Senior Engineer', 3: 'Engineer', 4: 'Senior Engineer', 5: 'Senior Staff', 6: 'Staff', 7: 'Senior Engineer', 8: 'Senior Staff', 9: 'Staff'}})
This outputs:
name | title | |
---|---|---|
10001 | Georgi Facello | Senior Engineer |
10002 | Bezalel Simmel | Staff |
10003 | Parto Bamford | Senior Engineer |
10004 | Chirstian Koblick | Senior Engineer |
10004 | Chirstian Koblick | Engineer |
10005 | Kyoichi Maliniak | Staff |
10005 | Kyoichi Maliniak | Senior Staff |
10006 | Anneke Preusig | Senior Engineer |
10007 | Tzvetan Zielinski | Staff |
10007 | Tzvetan Zielinski | Senior Staff |
I would like to be able to almagamate the similar rows, creating a list with all the different values for a particular record:
name | titles | |
---|---|---|
10001 | Georgi Facello | [Senior Engineer] |
10002 | Bezalel Simmel | [Staff] |
10003 | Parto Bamford | [Senior Engineer] |
10004 | Chirstian Koblick | [Engineer, Senior Engineer] |
10005 | Kyoichi Maliniak | [Senior Staff, Staff] |
10006 | Anneke Preusig | [Senior Engineer] |
10007 | Tzvetan Zielinski | [Senior Staff, Staff] |
What is the best (most readable, easiest to understand and maintain) way to achieve this, either in SQL or in Pandas? (I suspect this will be a lot easier in pandas but I can adjust the SQL to help)
Advertisement
Answer
dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index() dfnew
Output
emp_no name title 0 10001 Georgi Facello [Senior Engineer] 1 10002 Bezalel Simmel [Staff] 2 10003 Parto Bamford [Senior Engineer] 3 10004 Chirstian Koblick [Engineer, Senior Engineer] 4 10005 Kyoichi Maliniak [Senior Staff, Staff] 5 10006 Anneke Preusig [Senior Engineer] 6 10007 Tzvetan Zielinski [Senior Staff, Staff]