Skip to content
Advertisement

Self-Joining across nested Records in BigQuery

I’m trying to do some joins/aggregations between nested fields in single table and running into both SQL problems and the “Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN” error.

I’d love some SQL help with the general problem, but I’m also curious how to deal with that error.

My problem maps to the BigQuery patents data. In that dataset, a patent has classification data (the cpc record, where cpc.code is one classification code on a record with associated data cpc.inventive and cpc.first). A patent also has patents that it cites (the citation record, where citation.publication_number is a cited patent with associated data citation.type and citation.category). There are more fields in these records but let’s say these are the important ones.

What I want to get is something like this json, with one row per CPC and a record with information that captures how patents with that CPC cite other patents based on the CPCs on the cited patents and facets about both CPCs and the citation. The json would look something like this:

Where each unique cpc.code gets a row and an array. The array is records with information on the number of patents cited by patents with the row CPC (“citing_patent_cpc”) having a specific CPC (“cpc”) with various facets of the two patents’ CPCs and the type of citation.

For example, the first record in the example above means that 45 times patents with CPC “1234/123” as a inventive CPC but not the first CPC cited another patent with cpc “ABCD/345” as a first CPC but not an inventive CPC, and this citation was of type “ABC” and category “A”. Theoretically, each row could have a record for every CPC in the corpus * the number of possible facets, though in practice that’s not the case.

As a partial step, I tried to join in the cpc records from cited patents to the record for the citing patent. I got this query to work on a very small table declared in the SQL directly, but it gives “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” when I try to run it on large data (like the actual patents table).

Here’s that query:

I’d love to know:

  1. How to make this query work without that error.
  2. The overall solution to my problem, if anyone is feeling generous with SQL-fu.

Thank you to anyone who has read this far.

Advertisement

Answer

I think this should be a close approximation for the query that you want. It looks to be a big dataset, so I can’t comment on speed/efficiency. Hopefully the logic makes sense at least.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement