I’m new to SQL and would greatly appreciate your help with extracting data from a hive table. The table contains two relevant columns: host
and url
. The url
column has a lot of duplicates and similar urls redirecting to the same page, in the following format:
https://www.cnn.com/2019/09/20/politics/
https://www.cnn.com/2019/09/20/politics
http://www.cnn.com/2019/09/20/politics/
http://www.cnn.com/2019/09/20/politics
Example of host format:
https://www.cnn.com/
http://www.cnn.com/
I need a query to extract unique urls with preference of https
version over http
and url with trailing slash over url without trailing slash, when available. So for the example above the result should be:
https://www.cnn.com/2019/09/20/politics/
A simple bash or python script to do the same for local files would be extremely helpful, too.
Advertisement
Answer
Solution for Hive. Use row_number()
for removing duplicates. See how protocol_key
and path_key
are calculated, they are used in the row_number() partition by
clause:
with your_data as (--use your table instead of this
select stack( 4,
'https://www.cnn.com/2019/09/20/politics/',
'https://www.cnn.com/2019/09/20/politics',
'http://www.cnn.com/2019/09/20/politics/',
'http://www.cnn.com/2019/09/20/politics') as url
) --your table
select url from
(
select s.url, --s.protocol, s.protocol_key, s.host, s.path, s.path_key, --columns for debugging
row_number() over(partition by s.protocol_key, s.host, s.path_key order by s.protocol desc, s.path desc) rn --https and path with/ are preferred
from
(--parse url
select t.url, s.protocol, s.host, s.path, regexp_replace(s.path,'/$','') as path_key, regexp_replace(s.protocol,'s$','') as protocol_key
from your_data t
lateral view parse_url_tuple(url, 'PROTOCOL','HOST', 'PATH')s as protocol, host, path
)s)s where rn = 1
;
Result:
https://www.cnn.com/2019/09/20/politics/