I’m trying to select rows that have the same column values using BigQuery on Githubs public data. I’d approach it like so using SQL server but I am getting “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” error message.
select t1.id as id, t1.path as path from `bigquery-public-data.github_repos.sample_files` t1 where exists (select path from `bigquery-public-data.github_repos.sample_files` t2 where t1.path = t2.path group by path, id having count(id) > 1)
I’ve also tried performing self-joins like so:
SELECT t1.repo_name as repo_name, t1.path as path, t2.repo_name as reponame2, t2.path as path2 FROM `bigquery-public-data.github_repos.sample_files` as t1 JOIN `bigquery-public-data.github_repos.sample_files` as t2 ON t1.path = t2.path GROUP BY repo_name, path, reponame2, path2
But I’m getting timeout errors for this. What’s the correct way to achieve this?
Advertisement
Answer
I think you are looking for below (at least this is a direct translation of your original query to one that actually works while preserving the logic)
select id, path from `bigquery-public-data.github_repos.sample_files` qualify count(id) over(partition by path) > 1