Self-Joining across nested Records in BigQuery

I’m trying to do some joins/aggregations between nested fields in single table and running into both SQL problems and the “Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN” error.

I’d love some SQL help with the general problem, but I’m also curious how to deal with that error.

My problem maps to the BigQuery patents data. In that dataset, a patent has classification data (the cpc record, where cpc.code is one classification code on a record with associated data cpc.inventive and cpc.first). A patent also has patents that it cites (the citation record, where citation.publication_number is a cited patent with associated data citation.type and citation.category). There are more fields in these records but let’s say these are the important ones.

What I want to get is something like this json, with one row per CPC and a record with information that captures how patents with that CPC cite other patents based on the CPCs on the cited patents and facets about both CPCs and the citation. The json would look something like this:

[
  {
      "citing_patent_cpc": "1234/123",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 45
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 12
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 3
        },
        ...

      ]
  },
  {
      "citing_patent_cpc": "1234/ABC",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 16
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 3
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 9
        },
        ...
      ]
  },
  ...
]

Where each unique cpc.code gets a row and an array. The array is records with information on the number of patents cited by patents with the row CPC (“citing_patent_cpc”) having a specific CPC (“cpc”) with various facets of the two patents’ CPCs and the type of citation.

For example, the first record in the example above means that 45 times patents with CPC “1234/123” as a inventive CPC but not the first CPC cited another patent with cpc “ABCD/345” as a first CPC but not an inventive CPC, and this citation was of type “ABC” and category “A”. Theoretically, each row could have a record for every CPC in the corpus * the number of possible facets, though in practice that’s not the case.

As a partial step, I tried to join in the cpc records from cited patents to the record for the citing patent. I got this query to work on a very small table declared in the SQL directly, but it gives “Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.” when I try to run it on large data (like the actual patents table).

Here’s that query:

SELECT
  publication_number,
  cpc,
  citation,
  (
    SELECT ARRAY_CONCAT_AGG(cpc)
    FROM `patents-public-data.patents.publications` AS JoinedPatents
    RIGHT JOIN
      (
        SELECT publication_number
        FROM UNNEST(Patents.citation)
      ) AS unnestedcitation
    ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents

I’d love to know:

How to make this query work without that error.
The overall solution to my problem, if anyone is feeling generous with SQL-fu.

Thank you to anyone who has read this far.

Answer

I think this should be a close approximation for the query that you want. It looks to be a big dataset, so I can’t comment on speed/efficiency. Hopefully the logic makes sense at least.

with data as (
-- unnest your data
  select 
    p.publication_number,
    cp.code as cpc_code,
    cp.inventive as cpc_inventive,
    cp.first as cpc_first,
    ci.publication_number as citation_publication_number,
    ci.type as citation_type,
    ci.category as citation_category
  from `patents-public-data.patents.publications` p
  left join unnest(cpc) cp
  left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
  select 
    d1.cpc_code as citing_patent_cpc,
    d2.cpc_code as cpc,
    d1.cpc_inventive as citing_cpc_inventive,
    d1.cpc_first as citing_cpc_first,
    d1.citation_type,
    d1.citation_category,
    d2.cpc_inventive as cited_cpc_inventive,
    d2.cpc_first as cited_cpc_first,
    count(*) as count
  from data d1
  left join data d2 on d1.citation_publication_number = d2.publication_number
  group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
  select 
    citing_patent_cpc,
    array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
  from joined
  group by 1
)
select * from agged

Advertisement

Answer