Is there a way to parse csv string with escapings via HQL/SQL?

Question

I have a problem parsing csv-formatted data that is stored in a Hive table column that is loaded into PostgreSQL DB afterwards. What I need to do is to retrieve some fields from there, however, if a comma is enquoted, it should be treated as a part of data to retrieve; on top of that, quotes can be escaped th…

Accepted Answer

It is possible to split the string by comma followed by only an even number of quotes or zero number of quotes, not splitting if comma is inside quotes. This will only work if you have balanced quotes only (each quote has corresponding closing quote).Code (Hive):with mytable as(select 'a,b,c,"d,e,1","dj+""17"""' as original_string) select --remove quotes at the beginning and at the end of the string       regexp_replace(splitted[0],'^"(.*?)"$','$1') as col1,       regexp_replace(splitted[1],'^"(.*?)"$','$1') as col2,       regexp_replace(splitted[2],'^"(.*?)"$','$1') as col3,       regexp_replace(splitted[3],'^"(.*?)"$','$1') as col4,       regexp_replace(splitted[4],'^"(.*?)"$','$1') as col5from(select split(original_string, ',(?=(?:[^"]*"[^"]*")*[^"]*$)') as splitted from mytable)s;Result:col1    col2    col3    col4    col5a       b       c       d,e,1   dj+""17""Regexp ',(?=(?:[^"]*"[^"]*")*[^"]*$)' means:, &#8211; comma(?= &#8211; followed by group (zero length positive look-ahead) start(?:[^"]*"[^"]*") &#8211;non-capturing group consisting of 0+ non quote, quote, 0+ non-quote, quote* group repeated 0+$ &#8211; end of the string) &#8211; end of the followed by group (positive look ahead)To unquote elements like "d,e,1", expression regexp_replace(str,'^"(.*?)"$','$1') is used. If string has start and end quotes, it removes them.Also you may want additionally replace two double quotes with single one to convert values like this dj+""17"" to dj+"17".

Advertisement

Answer