I am using the Python tweepy
library to scrap and collect tweet data for a research purpose. Specifically is use tweepy.StreamingClient
class to get tweet the from stream. Returned data is json
object whose elements depends on the data fields available with retrieved tweet post.
So for example, at one time I received json
object with few elements like this:
{ "data": { "author_id": "1556031969062010881", "created_at": "2022-08-18T01:51:03.000Z", "geo": { "place_id": "006c6743642cb09c" }, "id": "1560081812604469248", "text": "Some text ..." }, "matching_rules": [ { "id": "1560077018183630848", "tag": "some-key-words" } ] }
And some time, returned object with expanded elements like:
{ "data": { "author_id": "1118585113657389059", "created_at": "2022-08-18T01:55:05.000Z", "geo": { "place_id": "00611f1548c7bdef" }, "id": "1560082826153828357", "referenced_tweets": [ { "type": "replied_to", "id": "1559833610697539585" } ], "text": "hhh personality" }, "includes": { "tweets": [ { "author_id": "1462820862969716746", "created_at": "2022-08-17T09:24:47.000Z", "geo": {}, "id": "1559833610697539585", "text": "What do men like in a woman?" } ] }, "matching_rules": [ { "id": "1560077018183630848", "tag": "some-key-words" } ] }
At even a more complex json
object with nested elements at times like:
{ "data": { "author_id": "1461871206425108480", "created_at": "2022-08-17T23:19:14.000Z", "geo": { "coordinates": { "type": "Point", "coordinates": [ -0.1094, 51.5141 ] }, "place_id": "3eb2c704fe8a50cb" }, "id": "1560043605762392066", "text": "#bitemoji #me #thinkingaboutmydeceasedKoreanGrandmother #ALWAYS #DAILY #AMEN @ London, United Kingdom " }, "matching_rules": [ { "id": "1560042248007458817", "tag": "some-key-words" } ] }
I am interested in elements of the data
, and matching_rules
groups, the latter contains 2-elements only, but former (data
) group element(s) can change or expand. For example the geo
element, which is of particular interest in this work, it has:
"geo": { "place_id": "some-value" },
in the first and second returned objects, but then in 3rd object, it is expanded like so:
"geo": { "coordinates": { "type": "Point", "coordinates": [ -0.1094, 51.5141 ] }, "place_id": "3eb2c704fe8a50cb" },
So we have to create 2-new columns to store the type": "Point",
and "coordinates": [-0.1094,51.5141]
values.
Yet, we can see in the second object, another geo
element in the "includes"
group. This completely not required, but could potentially “confuse” our searches.
How can I have dynamically store add column and an initially “narrow” starting table and insert values in this scenario?
I have used sqlite
to insert values from sensors an IoT research where I used the syntax:
con = sqlite3.connect("db-name.db") while True: # values received query = "INSERT INTO table_name(columns-names) VALUES(?,?,?,?,?,?,?,?,?)" data = (variable) cur = con.execute(query, data) con.commit() con.close()
However, in this case, the dynamic growth in table columns and/or changing values of some elements (e.g. geo
confuses me the more!
Advertisement
Answer
While you can alter tables dynamically, treating columns in the same way as rows is not a good idea. You can use one of the following options.
- Save the whole JSON in a single field and use SQLite’s JSON library.
- If you can define fields you are mostly interested in, you can use #1 with fixed generated columns defined at start and holding extracted values of interest.
- If you can define a finite amount of columns for all potential fields of interests, you can use a fixed schema and store for each tweet just the available fields.
- Perhaps, you should not be using a relational db to start with. Based on your data definition, you may want to use a document store instead.