Skip to content
Advertisement

Processing mulitple similar rows in Pandas

I have a dataframe pulled from a relational database. A one-to-many join has resulted in many similar rows with one column different. I would like to combine the similar rows but have the differing column data contained within a list, for each unique row. I am also able to change the SQL but I think this may be easier to do downstream within Pandas.

Here is a reproducable example:

    # example of SQL to fetch data from upstream source
    df = pd.read_sql('''
        select
            emp_no,
            concat(first_name, ' ', last_name) as "name",
            title        
        from employees
        join titles using (emp_no)
        limit 10;
    ''', engine)
    
    # or create the DF from a dict() - to enable recreation of exact data stucture
    df = pd.DataFrame({'emp_no': {0: 10001,
  1: 10002,
  2: 10003,
  3: 10004,
  4: 10004,
  5: 10005,
  6: 10005,
  7: 10006,
  8: 10007,
  9: 10007},
 'name': {0: 'Georgi Facello',
  1: 'Bezalel Simmel',
  2: 'Parto Bamford',
  3: 'Chirstian Koblick',
  4: 'Chirstian Koblick',
  5: 'Kyoichi Maliniak',
  6: 'Kyoichi Maliniak',
  7: 'Anneke Preusig',
  8: 'Tzvetan Zielinski',
  9: 'Tzvetan Zielinski'},
 'title': {0: 'Senior Engineer',
  1: 'Staff',
  2: 'Senior Engineer',
  3: 'Engineer',
  4: 'Senior Engineer',
  5: 'Senior Staff',
  6: 'Staff',
  7: 'Senior Engineer',
  8: 'Senior Staff',
  9: 'Staff'}})

This outputs:

name title
10001 Georgi Facello Senior Engineer
10002 Bezalel Simmel Staff
10003 Parto Bamford Senior Engineer
10004 Chirstian Koblick Senior Engineer
10004 Chirstian Koblick Engineer
10005 Kyoichi Maliniak Staff
10005 Kyoichi Maliniak Senior Staff
10006 Anneke Preusig Senior Engineer
10007 Tzvetan Zielinski Staff
10007 Tzvetan Zielinski Senior Staff

I would like to be able to almagamate the similar rows, creating a list with all the different values for a particular record:

name titles
10001 Georgi Facello [Senior Engineer]
10002 Bezalel Simmel [Staff]
10003 Parto Bamford [Senior Engineer]
10004 Chirstian Koblick [Engineer, Senior Engineer]
10005 Kyoichi Maliniak [Senior Staff, Staff]
10006 Anneke Preusig [Senior Engineer]
10007 Tzvetan Zielinski [Senior Staff, Staff]

What is the best (most readable, easiest to understand and maintain) way to achieve this, either in SQL or in Pandas? (I suspect this will be a lot easier in pandas but I can adjust the SQL to help)

Advertisement

Answer

dfnew = df.groupby(['emp_no', 'name'])['title'].apply(list).reset_index()
dfnew

Output

    emp_no  name                title
0   10001   Georgi Facello      [Senior Engineer]
1   10002   Bezalel Simmel      [Staff]
2   10003   Parto Bamford       [Senior Engineer]
3   10004   Chirstian Koblick   [Engineer, Senior Engineer]
4   10005   Kyoichi Maliniak    [Senior Staff, Staff]
5   10006   Anneke Preusig      [Senior Engineer]
6   10007   Tzvetan Zielinski   [Senior Staff, Staff]
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement