Skip to content
Advertisement

MariaDB Pivot Table Performace

I have a table containing data with dynamic categories:

+----------+--------------+---------------+---------+
| category | string_value | integer_value | user_id |
+----------+--------------+---------------+---------+
| cat_1    | NULL         | 1             |       1 |
| cat_1    | NULL         | 3             |       2 |
| cat_2    | foo          | NULL          |       1 |
| cat_2    | bar          | NULL          |       2 |
+----------+--------------+---------------+---------+

I need a pivoted version of this table which I use with statement:

select
  user_id,
  max(case when category == 'cat_1' then integer_value end) as 'cat_1',
  max(case when category == 'cat_2' then string_value end) as 'cat_2',
from my_table
group by user_id

This creates results in the following format:

+---------+-------+-------+
| user_id | cat_1 | cat_2 |
+---------+-------+-------+
|       1 |     1 | foo   |
|       2 |     3 | bar   |
+---------+-------+-------+

This query by itself is performing well also for many categories and table entries (e.g., for 8 categories and 240k entries, it takes around 20ms), but if i wrap this exact query in a select * from <query>, performance drops significantly (to 650ms).

Also, ordering by user_id does not influence the performance significantly, whereas ordering by any other field also causes a performance drop, even if an index of the respective field and user_id exists. I’m guessing that this approach in itself is not feasible for larger tables? However, I am curious what causes the extra execution time when adding the select * from <query part.

Background: I try to use this query to store dynamic user data, and I’d like to prevent changes to the table structure at runtime (i.e., adding a column). Any alternative would be welcome. I’m using MariaDB 10.5.5, and I need the solution to also work with MySQL 5.7 and SQL Server 2019.

Execution Plans:

Without surrounding select * from:

+----+-------------+-----------+-------+---------------+------------+---------+-----+--------+---------+----------+------------+-------+    
| id | select_type | table     | type  | possible_keys | key        | key_len | ref | rows   | r_rows  | filtered | r_filtered | Extra |
|----|-------------|-----------|-------|---------------|------------|---------|-----|--------|---------|----------|------------|-------|
|  1 | SIMPLE      | user_data | index |               | user_index |         |   9 | 226067 | 1619.00 |    100.0 |      99.88 |       |
+----+-------------+-----------+-------+---------------+------------+---------+-----+--------+---------+----------+------------+-------+

With surrounding select * from:

+----+-------------+------------+-------+---------------+------------+---------+-----+--------+-----------+----------+------------+-------+ 
| id | select_type | table      | type  | possible_keys | key        | key_len | ref | rows   | r_rows    | filtered | r_filtered | Extra |
|----|-------------|------------|-------|---------------|------------|---------|-----|--------|-----------|----------|------------|-------|
|  1 | PRIMARY     | <derived2> | ALL   |               |            |         |     | 226067 |    200.00 |    100.0 |      100.0 |       |
|  2 | DERIVED     | user_data  | index |               | user_index |       9 |     | 226067 | 242418.00 |    100.0 |      100.0 |       |
+----+-------------+------------+-------+---------------+------------+---------+-----+--------+-----------+----------+------------+-------+ 

Advertisement

Answer

Here is my speculation on what is happening.

You have an index on the underlying table which MariaDB uses for the aggregation. That means that no sorting is done . . . and just by reading the index, it can start returning rows.

This is a very nice feature. But when you just run the query, you are seeing the time to first row.

When you use a derived table, MariaDB must generate all the rows before returning any of them. So, the select * with a subquery is doing much more work.

That is why the second version is slower than the first. I would expect a query that returns many tens of thousands of rows to take more than 20 milliseconds on most machines.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement