I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe ‘|’ character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper 1 a|0.1|b|0.2 2 c|0.2|d|0.3|e|0.6 3 f|0.6
I am able to split the column by using split(mapper, "\|")
which gives me the array
id mapper 1 [a,0.1,b,0.2] 2 [c,0.2,d,0.3,e,0.6] 3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper 1 [a,0.1] 1 [b,0.2] 2 [c,0.2] 2 [d,0.3] 2 [e,0.6] 3 [f,0.6]
Actual
id mapper 1 a 1 0.1 1 b 1 0.2 etc .......
How can I achieve this?
Advertisement
Answer
I would suggest you to split your pairs split(mapper, '(?<=\d)\|(?=\w)')
, e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\d)\|(?=\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |
.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\.\d)\|(?=\w|\d)')
.
Update 2:
OK, the best way is to split on the second |
as follows
split(mapper, '(?<!\G[^\|]+)\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\G[^\|]+)\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]