Skip to content
Advertisement

Array operation on hive collect_set

I am working on hive on large dataset, I have table with colum array and the content of the colum is as follows.

["20190302Prod4"
"20190303Prod1"
"20190303Prod4"
"20190304Prod4"
"20190305Prod3"
"20190307Prod4"
"20190308Prod4"
"20190309Prod4"
"20190310Prod2"
"20190311Prod1"
"20190311Prod4"
"20190312Prod1"
"20190312Prod4"
"20190313Prod2"
"20190313Prod1"
"20190313Prod4"
"20190314Prod4"
"20190315Prod4"
"20190316Prod4"
"20190317Prod1"
"20190317Prod4"]

I need a set as per the asc date of prod e.g. I need to trim date from the array and apply collect_set to get below result.

["Prod4",
"Prod1",
"Prod3",
"Prod2"]

Advertisement

Answer

Explode array, remove date (digits at the beginning of the string), aggregate using collect_set:

with mydata as (--use your table instead of this
select array(
"20190302Prod4",
"20190303Prod1",
"20190303Prod4",
"20190304Prod4",
"20190305Prod3",
"20190307Prod4",
"20190308Prod4",
"20190309Prod4",
"20190310Prod2",
"20190311Prod1",
"20190311Prod4",
"20190312Prod1",
"20190312Prod4",
"20190313Prod2",
"20190313Prod1",
"20190313Prod4",
"20190314Prod4",
"20190315Prod4",
"20190316Prod4",
"20190317Prod1",
"20190317Prod4"
) myarray
)

select collect_set(regexp_extract(elem,'^\d*(.*?)$',1)) col_name 
  from mydata a --Use your table instead  
       lateral view outer explode(myarray) s as elem;

Result:

col_name    
["Prod4","Prod1","Prod3","Prod2"]   

One more possible method is to concatenate array first, remove dates from the string, split to get an array. Unfortunately we still need to explode to do collect_set to remove duplicates (example using the same WITH mydata CTE):

select collect_set(elem) col_name        
  from mydata a --Use your table instead 
       lateral view outer explode(split(regexp_replace(concat_ws(',',myarray),'(^|,)\d{8}','$1'),',')) s as elem
;
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement