Skip to content
Advertisement

Is there a way to parse csv string with escapings via HQL/SQL?

I have a problem parsing csv-formatted data that is stored in a Hive table column that is loaded into PostgreSQL DB afterwards. What I need to do is to retrieve some fields from there, however, if a comma is enquoted, it should be treated as a part of data to retrieve; on top of that, quotes can be escaped themselves. Here’s what I tried, how it worked out and what is expected to be the output:

Consider this string: a,b,c,"d,e,1","dj+""17"""

The output is expected to be like this: a b c d,e,1 dj+"17"

What I tried is using regexp_extract, like this: regexp_extract(data,'(,?(".*?"|[^,]*)){1}',2).

It works almost the way I intended it to work, the only problem is it doesn’t work properly with escaped quotes: it splits "dj+""17""" as dj+ 17 <empty string> on top of what I want it to do.

I found that Hive allows you to solve such tasks using CSVSerde, however, I’ve only seen it being used when the data is stored in a textfile, which is not the case. Is there a workaround this problem?

P.S.: I do lack expertise in Hive and Hadoop in general, so I might not be aware of certain principles and available functionality

Advertisement

Answer

It is possible to split the string by comma followed by only an even number of quotes or zero number of quotes, not splitting if comma is inside quotes. This will only work if you have balanced quotes only (each quote has corresponding closing quote).

Code (Hive):

with mytable as(
select 'a,b,c,"d,e,1","dj+""17"""' as original_string
) 
select --remove quotes at the beginning and at the end of the string
       regexp_replace(splitted[0],'^"(.*?)"$','$1') as col1,
       regexp_replace(splitted[1],'^"(.*?)"$','$1') as col2,
       regexp_replace(splitted[2],'^"(.*?)"$','$1') as col3,
       regexp_replace(splitted[3],'^"(.*?)"$','$1') as col4,
       regexp_replace(splitted[4],'^"(.*?)"$','$1') as col5
from
(
select split(original_string, ',(?=(?:[^"]*"[^"]*")*[^"]*$)') as splitted from mytable
)s;

Result:

col1    col2    col3    col4    col5
a       b       c       d,e,1   dj+""17""

Regexp ',(?=(?:[^"]*"[^"]*")*[^"]*$)' means:

, – comma

(?= – followed by group (zero length positive look-ahead) start

(?:[^"]*"[^"]*") –non-capturing group consisting of 0+ non quote, quote, 0+ non-quote, quote

* group repeated 0+

$ – end of the string

) – end of the followed by group (positive look ahead)

To unquote elements like "d,e,1", expression regexp_replace(str,'^"(.*?)"$','$1') is used. If string has start and end quotes, it removes them.

Also you may want additionally replace two double quotes with single one to convert values like this dj+""17"" to dj+"17".

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement