Skip to content
Advertisement

Service that does advanced queries on a data set, and automatically returns relevant updated results every time new data is added to the set?

I’m looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in “real time”.

In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.

We also want to show “live” statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.

All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.

Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.

What I wish for is a pay as you go cloud service that does the following:

  1. I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
  2. On every vote, we push the vote data to this service, which will store it in a row.
  3. The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
  4. Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) – so we can update the relevant Firestore documents.
  5. The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.

Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?

I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows – but I believe that I’m not the first developer with a need of this, so I hope that someone has solved it before me 🙂

Advertisement

Answer

You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks

The services can be used to solve the problems mentioned above:

1) Google BigQuery : Here you can define schema for the data on which you’re going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.

2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.

3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.

4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you’ll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.

5) In the same service which inserts every vote, create a new record with a “syncId” in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.

We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement