Skip to content
Advertisement

get rows that have the same value column

I’m trying to select rows that have the same column values using BigQuery on Githubs public data. I’d approach it like so using SQL server but I am getting “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” error message.

 select t1.id as id, t1.path as path
    from `bigquery-public-data.github_repos.sample_files` t1
    where exists (select path
                  from `bigquery-public-data.github_repos.sample_files` t2
                  where t1.path = t2.path
                  group by path, id
                  having count(id) > 1)

I’ve also tried performing self-joins like so:

 SELECT t1.repo_name as repo_name, t1.path as path, t2.repo_name as reponame2, t2.path as path2
FROM `bigquery-public-data.github_repos.sample_files` as t1 
JOIN `bigquery-public-data.github_repos.sample_files` as t2 ON t1.path = t2.path GROUP BY repo_name, path, reponame2, path2

But I’m getting timeout errors for this. What’s the correct way to achieve this?

Advertisement

Answer

I think you are looking for below (at least this is a direct translation of your original query to one that actually works while preserving the logic)

select id, path
from `bigquery-public-data.github_repos.sample_files` 
qualify count(id) over(partition by path) > 1

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement