Skip to content
Advertisement

Normalizing an adjacency list

I have data represented in this fashion (not sure I properly call it an adjacency list, but that’s the best description I can think of):

+--------+-------+
| Level  | ID    |
+========+=======+
| Parent | A-123 |
+--------+-------+
| Child  | B-123 |
+--------+-------+
| Child  | B-456 |
+--------+-------+
| Child  | B-789 |
+--------+-------+
| Parent | A-456 |
+--------+-------+
| Child  | C-123 |
+--------+-------+
| Child  | C-456 |
+--------+-------+
| Child  | C-789 |
+--------+-------+

I need to transform it to this format:

+--------+-------+
| Parent | Child |
+========+=======+
| A-123  | B-123 |
+--------+-------+
| A-123  | B-456 |
+--------+-------+
| A-123  | B-789 |
+--------+-------+
| A-456  | C-123 |
+--------+-------+
| A-456  | C-456 |
+--------+-------+
| A-456  | C-789 |
+--------+-------+

I think this can be done with an SQL window function (I would use sqlite for simplicity) and I’m sure pandas has some tricks in its bag as well, but in any case, I’m not very familiar with these more advanced techniques.

Any idea?

Advertisement

Answer

Pandas solution – replace rows after Parent to missing values and forward filling ID, then remove rows with Parent and rename columns:

m = df['Level'].eq('Parent')

df['Level'] = df['ID'].where(m).ffill()
df = df[~m].rename(columns={'Level':'Parent','ID':'Child'})
print (df)
  Parent Child
1   A123  B123
2   A123  B456
3   A123  B789
5   A456  C123
6   A456  C456
7   A456  C789
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement