Normalizing an adjacency list

I have data represented in this fashion (not sure I properly call it an adjacency list, but that’s the best description I can think of):

+--------+-------+
| Level  | ID    |
+========+=======+
| Parent | A-123 |
+--------+-------+
| Child  | B-123 |
+--------+-------+
| Child  | B-456 |
+--------+-------+
| Child  | B-789 |
+--------+-------+
| Parent | A-456 |
+--------+-------+
| Child  | C-123 |
+--------+-------+
| Child  | C-456 |
+--------+-------+
| Child  | C-789 |
+--------+-------+

​x
 
+--------+-------+| Level  | ID    |+========+=======+| Parent | A-123 |+--------+-------+| Child  | B-123 |+--------+-------+| Child  | B-456 |+--------+-------+| Child  | B-789 |+--------+-------+| Parent | A-456 |+--------+-------+| Child  | C-123 |+--------+-------+| Child  | C-456 |+--------+-------+| Child  | C-789 |+--------+-------+​​

I need to transform it to this format:

+--------+-------+
| Parent | Child |
+========+=======+
| A-123  | B-123 |
+--------+-------+
| A-123  | B-456 |
+--------+-------+
| A-123  | B-789 |
+--------+-------+
| A-456  | C-123 |
+--------+-------+
| A-456  | C-456 |
+--------+-------+
| A-456  | C-789 |
+--------+-------+

 
+--------+-------+| Parent | Child |+========+=======+| A-123  | B-123 |+--------+-------+| A-123  | B-456 |+--------+-------+| A-123  | B-789 |+--------+-------+| A-456  | C-123 |+--------+-------+| A-456  | C-456 |+--------+-------+| A-456  | C-789 |+--------+-------+​

I think this can be done with an SQL window function (I would use sqlite for simplicity) and I’m sure pandas has some tricks in its bag as well, but in any case, I’m not very familiar with these more advanced techniques.

Any idea?

Answer

Pandas solution – replace rows after Parent to missing values and forward filling ID, then remove rows with Parent and rename columns:

m = df['Level'].eq('Parent')

df['Level'] = df['ID'].where(m).ffill()
df = df[~m].rename(columns={'Level':'Parent','ID':'Child'})
print (df)
  Parent Child
1   A123  B123
2   A123  B456
3   A123  B789
5   A456  C123
6   A456  C456
7   A456  C789

 
m = df['Level'].eq('Parent')​df['Level'] = df['ID'].where(m).ffill()df = df[~m].rename(columns={'Level':'Parent','ID':'Child'})print (df)  Parent Child1   A123  B1232   A123  B4563   A123  B7895   A456  C1236   A456  C4567   A456  C789​

Advertisement

Answer