Skip to content
Advertisement

Extracting unique values with SQL [closed]

I’m new to SQL and would greatly appreciate your help with extracting data from a hive table. The table contains two relevant columns: host and url. The url column has a lot of duplicates and similar urls redirecting to the same page, in the following format:

Example of host format:

I need a query to extract unique urls with preference of https version over http and url with trailing slash over url without trailing slash, when available. So for the example above the result should be:

https://www.cnn.com/2019/09/20/politics/

A simple bash or python script to do the same for local files would be extremely helpful, too.

Advertisement

Answer

Solution for Hive. Use row_number() for removing duplicates. See how protocol_key and path_key are calculated, they are used in the row_number() partition by clause:

Result:

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement