Skip to content
Advertisement

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe ‘|’ character. Then I want to group up the 2 adjacent values and separate them into separate rows.

Example, I have a table

id mapper

1  a|0.1|b|0.2
2  c|0.2|d|0.3|e|0.6
3  f|0.6

I am able to split the column by using split(mapper, "\|") which gives me the array

id mapper

1  [a,0.1,b,0.2]
2  [c,0.2,d,0.3,e,0.6]
3  [f,0.6]

Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.

Expected:

id mapper

1  [a,0.1]
1  [b,0.2]
2  [c,0.2]
2  [d,0.3]
2  [e,0.6]
3  [f,0.6]

Actual

id mapper

1  a
1  0.1
1  b
1  0.2 
etc .......

How can I achieve this?

Advertisement

Answer

I would suggest you to split your pairs split(mapper, '(?<=\d)\|(?=\w)'), e.g.

split('c|0.2|d|0.3|e|0.6', '(?<=\d)\|(?=\w)')

results in

["c|0.2","d|0.3","e|0.6"]

then explode the resulting array and split by |.

Update:

If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\.\d)\|(?=\w|\d)').

Update 2:

OK, the best way is to split on the second | as follows

split(mapper, '(?<!\G[^\|]+)\|')

e.g.

split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\G[^\|]+)\|')

results in

["6193439|0.0444035224643987","6186654|0.0444035224643987"]
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement