I’m new to SQL and would greatly appreciate your help with extracting data from a hive table. The table contains two relevant columns: host
and url
. The url
column has a lot of duplicates and similar urls redirecting to the same page, in the following format:
https://www.cnn.com/2019/09/20/politics/ https://www.cnn.com/2019/09/20/politics http://www.cnn.com/2019/09/20/politics/ http://www.cnn.com/2019/09/20/politics
Example of host format:
https://www.cnn.com/ http://www.cnn.com/
I need a query to extract unique urls with preference of https
version over http
and url with trailing slash over url without trailing slash, when available. So for the example above the result should be:
https://www.cnn.com/2019/09/20/politics/
A simple bash or python script to do the same for local files would be extremely helpful, too.
Advertisement
Answer
Solution for Hive. Use row_number()
for removing duplicates. See how protocol_key
and path_key
are calculated, they are used in the row_number() partition by
clause:
with your_data as (--use your table instead of this select stack( 4, 'https://www.cnn.com/2019/09/20/politics/', 'https://www.cnn.com/2019/09/20/politics', 'http://www.cnn.com/2019/09/20/politics/', 'http://www.cnn.com/2019/09/20/politics') as url ) --your table select url from ( select s.url, --s.protocol, s.protocol_key, s.host, s.path, s.path_key, --columns for debugging row_number() over(partition by s.protocol_key, s.host, s.path_key order by s.protocol desc, s.path desc) rn --https and path with/ are preferred from (--parse url select t.url, s.protocol, s.host, s.path, regexp_replace(s.path,'/$','') as path_key, regexp_replace(s.protocol,'s$','') as protocol_key from your_data t lateral view parse_url_tuple(url, 'PROTOCOL','HOST', 'PATH')s as protocol, host, path )s)s where rn = 1 ;
Result:
https://www.cnn.com/2019/09/20/politics/