Dynamanically add column(s) and insert values into a Sqlite TABLE from json objects

I am using the Python tweepy library to scrap and collect tweet data for a research purpose. Specifically is use tweepy.StreamingClient class to get tweet the from stream. Returned data is json object whose elements depends on the data fields available with retrieved tweet post.

So for example, at one time I received json object with few elements like this:

{
  "data": {
    "author_id": "1556031969062010881",
    "created_at": "2022-08-18T01:51:03.000Z",
    "geo": {
      "place_id": "006c6743642cb09c"
    },
    "id": "1560081812604469248",
    "text": "Some text ..."
  },
  "matching_rules": [
    {
      "id": "1560077018183630848",
      "tag": "some-key-words"
    }
  ]
}

And some time, returned object with expanded elements like:

{
  "data": {
    "author_id": "1118585113657389059",
    "created_at": "2022-08-18T01:55:05.000Z",
    "geo": {
      "place_id": "00611f1548c7bdef"
    },
    "id": "1560082826153828357",
    "referenced_tweets": [
      {
        "type": "replied_to",
        "id": "1559833610697539585"
      }
    ],
    "text": "hhh personality"
  },
  "includes": {
    "tweets": [
      {
        "author_id": "1462820862969716746",
        "created_at": "2022-08-17T09:24:47.000Z",
        "geo": {},
        "id": "1559833610697539585",
        "text": "What do men like in a woman?"
      }
    ]
  },
  "matching_rules": [
    {
      "id": "1560077018183630848",
      "tag": "some-key-words"
    }
  ]
}

At even a more complex json object with nested elements at times like:

{
  "data": {
    "author_id": "1461871206425108480",
    "created_at": "2022-08-17T23:19:14.000Z",
    "geo": {
      "coordinates": {
        "type": "Point",
        "coordinates": [
          -0.1094,
          51.5141
        ]
      },
      "place_id": "3eb2c704fe8a50cb"
    },
    "id": "1560043605762392066",
    "text": "#bitemoji #me #thinkingaboutmydeceasedKoreanGrandmother #ALWAYS #DAILY #AMEN @ London, United Kingdom "
  },
  "matching_rules": [
    {
      "id": "1560042248007458817",
      "tag": "some-key-words"
    }
  ]
}

I am interested in elements of the data, and matching_rules groups, the latter contains 2-elements only, but former (data) group element(s) can change or expand. For example the geo element, which is of particular interest in this work, it has:

"geo": {
      "place_id": "some-value"
    },

in the first and second returned objects, but then in 3rd object, it is expanded like so:

"geo": {
      "coordinates": {
        "type": "Point",
        "coordinates": [
          -0.1094,
          51.5141
        ]
      },
      "place_id": "3eb2c704fe8a50cb"
    },

So we have to create 2-new columns to store the type": "Point", and "coordinates": [-0.1094,51.5141] values.

Yet, we can see in the second object, another geo element in the "includes" group. This completely not required, but could potentially “confuse” our searches.

How can I have dynamically store add column and an initially “narrow” starting table and insert values in this scenario?

I have used sqlite to insert values from sensors an IoT research where I used the syntax:

con = sqlite3.connect("db-name.db")

while True:
    # values received
    query = "INSERT INTO table_name(columns-names) VALUES(?,?,?,?,?,?,?,?,?)"
    data = (variable)
    cur = con.execute(query, data)
    con.commit()
con.close()

However, in this case, the dynamic growth in table columns and/or changing values of some elements (e.g. geo confuses me the more!

Answer

While you can alter tables dynamically, treating columns in the same way as rows is not a good idea. You can use one of the following options.

Save the whole JSON in a single field and use SQLite’s JSON library.
If you can define fields you are mostly interested in, you can use #1 with fixed generated columns defined at start and holding extracted values of interest.
If you can define a finite amount of columns for all potential fields of interests, you can use a fixed schema and store for each tweet just the available fields.
Perhaps, you should not be using a relational db to start with. Based on your data definition, you may want to use a document store instead.

Advertisement

Answer