I have a problem parsing csv-formatted data that is stored in a Hive table column that is loaded into PostgreSQL DB afterwards. What I need to do is to retrieve some fields from there, however, if a comma is enquoted, it should be treated as a part of data to retrieve; on top of that, quotes can be escaped themselves. Here’s what I tried, how it worked out and what is expected to be the output:
Consider this string: a,b,c,"d,e,1","dj+""17"""
The output is expected to be like this: a b c d,e,1 dj+"17"
What I tried is using regexp_extract
, like this:
regexp_extract(data,'(,?(".*?"|[^,]*)){1}',2)
.
It works almost the way I intended it to work, the only problem is it doesn’t work properly with escaped quotes: it splits "dj+""17"""
as dj+ 17 <empty string>
on top of what I want it to do.
I found that Hive allows you to solve such tasks using CSVSerde, however, I’ve only seen it being used when the data is stored in a textfile, which is not the case. Is there a workaround this problem?
P.S.: I do lack expertise in Hive and Hadoop in general, so I might not be aware of certain principles and available functionality
Advertisement
Answer
It is possible to split the string by comma followed by only an even number of quotes or zero number of quotes, not splitting if comma is inside quotes. This will only work if you have balanced quotes only (each quote has corresponding closing quote).
Code (Hive):
with mytable as( select 'a,b,c,"d,e,1","dj+""17"""' as original_string ) select --remove quotes at the beginning and at the end of the string regexp_replace(splitted[0],'^"(.*?)"$','$1') as col1, regexp_replace(splitted[1],'^"(.*?)"$','$1') as col2, regexp_replace(splitted[2],'^"(.*?)"$','$1') as col3, regexp_replace(splitted[3],'^"(.*?)"$','$1') as col4, regexp_replace(splitted[4],'^"(.*?)"$','$1') as col5 from ( select split(original_string, ',(?=(?:[^"]*"[^"]*")*[^"]*$)') as splitted from mytable )s;
Result:
col1 col2 col3 col4 col5 a b c d,e,1 dj+""17""
Regexp ',(?=(?:[^"]*"[^"]*")*[^"]*$)'
means:
,
– comma
(?=
– followed by group (zero length positive look-ahead) start
(?:[^"]*"[^"]*")
–non-capturing group consisting of 0+ non quote, quote, 0+ non-quote, quote
*
group repeated 0+
$
– end of the string
)
– end of the followed by group (positive look ahead)
To unquote elements like "d,e,1"
, expression regexp_replace(str,'^"(.*?)"$','$1')
is used. If string has start and end quotes, it removes them.
Also you may want additionally replace two double quotes with single one to convert values like this dj+""17""
to dj+"17"
.