Skip to content
Advertisement

Extracting unique values with SQL [closed]

I’m new to SQL and would greatly appreciate your help with extracting data from a hive table. The table contains two relevant columns: host and url. The url column has a lot of duplicates and similar urls redirecting to the same page, in the following format:

https://www.cnn.com/2019/09/20/politics/
https://www.cnn.com/2019/09/20/politics
http://www.cnn.com/2019/09/20/politics/
http://www.cnn.com/2019/09/20/politics

Example of host format:

https://www.cnn.com/
http://www.cnn.com/

I need a query to extract unique urls with preference of https version over http and url with trailing slash over url without trailing slash, when available. So for the example above the result should be:

https://www.cnn.com/2019/09/20/politics/

A simple bash or python script to do the same for local files would be extremely helpful, too.

Advertisement

Answer

Solution for Hive. Use row_number() for removing duplicates. See how protocol_key and path_key are calculated, they are used in the row_number() partition by clause:

with your_data as (--use your table instead of this
select stack( 4,
'https://www.cnn.com/2019/09/20/politics/',
'https://www.cnn.com/2019/09/20/politics',
'http://www.cnn.com/2019/09/20/politics/',
'http://www.cnn.com/2019/09/20/politics') as url
) --your table

select url from
(
select s.url, --s.protocol, s.protocol_key, s.host, s.path, s.path_key, --columns for debugging
       row_number() over(partition by s.protocol_key, s.host, s.path_key order by s.protocol desc, s.path desc) rn --https and path with/ are preferred
from       
(--parse url
select t.url, s.protocol, s.host, s.path, regexp_replace(s.path,'/$','') as path_key, regexp_replace(s.protocol,'s$','') as protocol_key
  from your_data t
       lateral view parse_url_tuple(url, 'PROTOCOL','HOST', 'PATH')s as protocol, host, path
)s)s where rn = 1
;

Result:

https://www.cnn.com/2019/09/20/politics/
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement