Skip to content
Advertisement

Data model guidance, database choice for aggregations on changing filter criteria

Problem:

We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.

We have different underlying data but this example should showcase the fundamental problem:

Let’s say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:

  • In 2018 which American had the most German friends that like ACDC?
  • Which are the friends that person X most interacted with on topic Y?

The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.

Non-Functional Requirements:

  • It is an offline use-case, meaning there are no inserts, deletes or updates happening, instead every X weeks a new complete dump is imported to replace the old data.
  • We would like to have an upper bound of 10 seconds to answer our queries. The faster the better max 2 seconds for queries would be great.
  • The actual data has around 100-200 million entries, growth rate is linear.
  • The system has to serve a limited amount of concurrent users, max 100.

Questions:

  • What would be the right database technology or mixture of technologies to solve our problem?
  • What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
  • (Bonus) What would be the estimated hardware requirements given a specific technology?

What we tried so far:

  • Setting up a document store with denormalized entries. Problem: It doesn’t perform well on general queries because it has to scan too many entries for aggregations.
  • Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.

Advertisement

Answer

You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.

The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.

A business intelligence (BI) tool is a “lighter” version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.

Now, there’s the question of which data warehouse and/or BI solution to use.

That’s a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.

Lastly, there’s the data modeling piece of it.

To summarize your requirements, you have “lots of changing filter criteria” and varied statistics that you’ll need, for a variety of entities.

The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you’re using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.

Let’s look at the example you gave:

Data of Facebook friend requests and interactions over time. You need to answer:

  1. In 2018 which American had the most German friends that like ACDC?

  2. Which are the friends that person X most interacted with on topic Y?

You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.

One potential dataset can be structured like:

Date of Interaction | Initiating Person’s Country | Responding Person’s Country | Topic | Interaction Type | Initiating Person’s Top Interest | Responding Person’s Top Interest

This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.

As you can tell, this is just scratching the surface of a massive topic, but what you’re asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies 🙂

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement