Processing mulitple similar rows in Pandas

I have a dataframe pulled from a relational database. A one-to-many join has resulted in many similar rows with one column different. I would like to combine the similar rows but have the differing column data contained within a list, for each unique row. I am also able to change the SQL but I think this may be easier to do downstream within Pandas.

Here is a reproducable example:

    # example of SQL to fetch data from upstream source
    df = pd.read_sql('''
        select
            emp_no,
            concat(first_name, ' ', last_name) as "name",
            title        
        from employees
        join titles using (emp_no)
        limit 10;
    ''', engine)
    
    # or create the DF from a dict() - to enable recreation of exact data stucture
    df = pd.DataFrame({'emp_no': {0: 10001,
  1: 10002,
  2: 10003,
  3: 10004,
  4: 10004,
  5: 10005,
  6: 10005,
  7: 10006,
  8: 10007,
  9: 10007},
 'name': {0: 'Georgi Facello',
  1: 'Bezalel Simmel',
  2: 'Parto Bamford',
  3: 'Chirstian Koblick',
  4: 'Chirstian Koblick',
  5: 'Kyoichi Maliniak',
  6: 'Kyoichi Maliniak',
  7: 'Anneke Preusig',
  8: 'Tzvetan Zielinski',
  9: 'Tzvetan Zielinski'},
 'title': {0: 'Senior Engineer',
  1: 'Staff',
  2: 'Senior Engineer',
  3: 'Engineer',
  4: 'Senior Engineer',
  5: 'Senior Staff',
  6: 'Staff',
  7: 'Senior Engineer',
  8: 'Senior Staff',
  9: 'Staff'}})

​x
 
    # example of SQL to fetch data from upstream source    df = pd.read_sql('''        select            emp_no,            concat(first_name, ' ', last_name) as "name",            title                from employees        join titles using (emp_no)        limit 10;    ''', engine)        # or create the DF from a dict() - to enable recreation of exact data stucture    df = pd.DataFrame({'emp_no': {0: 10001,  1: 10002,  2: 10003,  3: 10004,  4: 10004,  5: 10005,  6: 10005,  7: 10006,  8: 10007,  9: 10007}, 'name': {0: 'Georgi Facello',  1: 'Bezalel Simmel',  2: 'Parto Bamford',  3: 'Chirstian Koblick',  4: 'Chirstian Koblick',  5: 'Kyoichi Maliniak',  6: 'Kyoichi Maliniak',  7: 'Anneke Preusig',  8: 'Tzvetan Zielinski',  9: 'Tzvetan Zielinski'}, 'title': {0: 'Senior Engineer',  1: 'Staff',  2: 'Senior Engineer',  3: 'Engineer',  4: 'Senior Engineer',  5: 'Senior Staff',  6: 'Staff',  7: 'Senior Engineer',  8: 'Senior Staff',  9: 'Staff'}})​

This outputs:

	name	title
10001	Georgi Facello	Senior Engineer
10002	Bezalel Simmel	Staff
10003	Parto Bamford	Senior Engineer
10004	Chirstian Koblick	Senior Engineer
10004	Chirstian Koblick	Engineer
10005	Kyoichi Maliniak	Staff
10005	Kyoichi Maliniak	Senior Staff
10006	Anneke Preusig	Senior Engineer
10007	Tzvetan Zielinski	Staff
10007	Tzvetan Zielinski	Senior Staff

I would like to be able to almagamate the similar rows, creating a list with all the different values for a particular record:

	name	titles
10001	Georgi Facello	`[Senior Engineer]`
10002	Bezalel Simmel	`[Staff]`
10003	Parto Bamford	`[Senior Engineer]`
10004	Chirstian Koblick	`[Engineer, Senior Engineer]`
10005	Kyoichi Maliniak	`[Senior Staff, Staff]`
10006	Anneke Preusig	`[Senior Engineer]`
10007	Tzvetan Zielinski	`[Senior Staff, Staff]`

What is the best (most readable, easiest to understand and maintain) way to achieve this, either in SQL or in Pandas? (I suspect this will be a lot easier in pandas but I can adjust the SQL to help)

Answer

dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index()
dfnew

 
dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index()dfnew​

Output

    emp_no  name                title
0   10001   Georgi Facello      [Senior Engineer]
1   10002   Bezalel Simmel      [Staff]
2   10003   Parto Bamford       [Senior Engineer]
3   10004   Chirstian Koblick   [Engineer, Senior Engineer]
4   10005   Kyoichi Maliniak    [Senior Staff, Staff]
5   10006   Anneke Preusig      [Senior Engineer]
6   10007   Tzvetan Zielinski   [Senior Staff, Staff]

 
    emp_no  name                title0   10001   Georgi Facello      [Senior Engineer]1   10002   Bezalel Simmel      [Staff]2   10003   Parto Bamford       [Senior Engineer]3   10004   Chirstian Koblick   [Engineer, Senior Engineer]4   10005   Kyoichi Maliniak    [Senior Staff, Staff]5   10006   Anneke Preusig      [Senior Engineer]6   10007   Tzvetan Zielinski   [Senior Staff, Staff]​

Advertisement

Answer