I’m trying to do some joins/aggregations between nested fields in single table and running into both SQL problems and the “Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN” error.
I’d love some SQL help with the general problem, but I’m also curious how to deal with that error.
My problem maps to the BigQuery patents data. In that dataset, a patent has classification data (the cpc
record, where cpc.code
is one classification code on a record with associated data cpc.inventive
and cpc.first
). A patent also has patents that it cites (the citation
record, where citation.publication_number
is a cited patent with associated data citation.type
and citation.category
). There are more fields in these records but let’s say these are the important ones.
What I want to get is something like this json, with one row per CPC and a record with information that captures how patents with that CPC cite other patents based on the CPCs on the cited patents and facets about both CPCs and the citation. The json would look something like this:
[ { "citing_patent_cpc": "1234/123", "cited_patent_cpcs": [ { "cpc": "ABCD/345", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": "A", "cited_cpc_inventive": false, "cited_cpc_first": true, "count": 45 }, { "cpc": "ABCD/345", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": "A", "cited_cpc_inventive": false, "cited_cpc_first": false, "count": 12 }, { "cpc": "H211/123", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": null, "cited_cpc_inventive": true, "cited_cpc_first": false, "count": 3 }, ... ] }, { "citing_patent_cpc": "1234/ABC", "cited_patent_cpcs": [ { "cpc": "ABCD/345", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": "A", "cited_cpc_inventive": false, "cited_cpc_first": true, "count": 16 }, { "cpc": "ABCD/345", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": "A", "cited_cpc_inventive": false, "cited_cpc_first": false, "count": 3 }, { "cpc": "H211/123", "citing_cpc_inventive": true, "citing_cpc_first": false, "citation_type": "ABC", "citation_category": null, "cited_cpc_inventive": true, "cited_cpc_first": false, "count": 9 }, ... ] }, ... ]
Where each unique cpc.code
gets a row and an array. The array is records with information on the number of patents cited by patents with the row CPC (“citing_patent_cpc”) having a specific CPC (“cpc”) with various facets of the two patents’ CPCs and the type of citation.
For example, the first record in the example above means that 45 times patents with CPC “1234/123” as a inventive CPC but not the first CPC cited another patent with cpc “ABCD/345” as a first CPC but not an inventive CPC, and this citation was of type “ABC” and category “A”. Theoretically, each row could have a record for every CPC in the corpus * the number of possible facets, though in practice that’s not the case.
As a partial step, I tried to join in the cpc records from cited patents to the record for the citing patent. I got this query to work on a very small table declared in the SQL directly, but it gives “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” when I try to run it on large data (like the actual patents table).
Here’s that query:
SELECT publication_number, cpc, citation, ( SELECT ARRAY_CONCAT_AGG(cpc) FROM `patents-public-data.patents.publications` AS JoinedPatents RIGHT JOIN ( SELECT publication_number FROM UNNEST(Patents.citation) ) AS unnestedcitation ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc FROM `patents-public-data.patents.publications`AS Patents
I’d love to know:
- How to make this query work without that error.
- The overall solution to my problem, if anyone is feeling generous with SQL-fu.
Thank you to anyone who has read this far.
Advertisement
Answer
I think this should be a close approximation for the query that you want. It looks to be a big dataset, so I can’t comment on speed/efficiency. Hopefully the logic makes sense at least.
with data as ( -- unnest your data select p.publication_number, cp.code as cpc_code, cp.inventive as cpc_inventive, cp.first as cpc_first, ci.publication_number as citation_publication_number, ci.type as citation_type, ci.category as citation_category from `patents-public-data.patents.publications` p left join unnest(cpc) cp left join unnest(citation) ci ), joined as ( -- do a self-join to join citation publication_number to original publication_number, group to get counts select d1.cpc_code as citing_patent_cpc, d2.cpc_code as cpc, d1.cpc_inventive as citing_cpc_inventive, d1.cpc_first as citing_cpc_first, d1.citation_type, d1.citation_category, d2.cpc_inventive as cited_cpc_inventive, d2.cpc_first as cited_cpc_first, count(*) as count from data d1 left join data d2 on d1.citation_publication_number = d2.publication_number group by 1,2,3,4,5,6,7,8 ), agged as ( -- aggrecate to match requested output select citing_patent_cpc, array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs from joined group by 1 ) select * from agged