I’m trying to select rows that have the same column values using BigQuery on Githubs public data. I’d approach it like so using SQL server but I am getting “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” error message.
x
select t1.id as id, t1.path as path
from `bigquery-public-data.github_repos.sample_files` t1
where exists (select path
from `bigquery-public-data.github_repos.sample_files` t2
where t1.path = t2.path
group by path, id
having count(id) > 1)
I’ve also tried performing self-joins like so:
SELECT t1.repo_name as repo_name, t1.path as path, t2.repo_name as reponame2, t2.path as path2
FROM `bigquery-public-data.github_repos.sample_files` as t1
JOIN `bigquery-public-data.github_repos.sample_files` as t2 ON t1.path = t2.path GROUP BY repo_name, path, reponame2, path2
But I’m getting timeout errors for this. What’s the correct way to achieve this?
Advertisement
Answer
I think you are looking for below (at least this is a direct translation of your original query to one that actually works while preserving the logic)
select id, path
from `bigquery-public-data.github_repos.sample_files`
qualify count(id) over(partition by path) > 1