Extracting unique values with SQL [closed]

Question

I'm new to SQL and would greatly appreciate your help with extracting data from a hive table. The table contains two relevant columns: host and url. The url column has a lot of duplicates and ...

Accepted Answer

Solution for Hive. Use row_number() for removing duplicates. See how protocol_key and path_key are calculated, they are used in the row_number() partition by clause: with your_data as (--use your table instead of thisselect stack( 4,'https://www.cnn.com/2019/09/20/politics/','https://www.cnn.com/2019/09/20/politics','http://www.cnn.com/2019/09/20/politics/','http://www.cnn.com/2019/09/20/politics') as url) --your tableselect url from(select s.url, --s.protocol, s.protocol_key, s.host, s.path, s.path_key, --columns for debugging       row_number() over(partition by s.protocol_key, s.host, s.path_key order by s.protocol desc, s.path desc) rn --https and path with/ are preferredfrom       (--parse urlselect t.url, s.protocol, s.host, s.path, regexp_replace(s.path,'/$','') as path_key, regexp_replace(s.protocol,'s$','') as protocol_key  from your_data t       lateral view parse_url_tuple(url, 'PROTOCOL','HOST', 'PATH')s as protocol, host, path)s)s where rn = 1;Result:https://www.cnn.com/2019/09/20/politics/

Advertisement

Answer