Skip to content
Advertisement

Optimization of Group by Cube in SQL Server

I would like to make a GROUP BY CUBE of a table with 9 columns and more than 107 millions of rows. Here is an example of my code:

 select     id
            ,case when grouping(cod_01)    = 0 then cod_01    else 0   end cod_01
            ,case when grouping(cod_02)     = 0 then cod_02     else 0   end cod_02
            ,case when grouping(cod_03)        = 0 then cod_03        else 0   end cod_03
            ,case when grouping(cod_04)   = 0 then cod_04   else 0   end cod_04
            ,case when grouping(cod_05) = 0 then cod_05 else 0   end cod_05
            ,case when grouping(input)    = 0 then input    else '0' end cod_input
            ,date
            ,historical
            ,COUNT(distinct pp) value
     from tmp.test
     where final_state in ('A','B')
     group by id
              ,cod_01
              ,cod_02
              ,cod_03
              ,cod_04
              ,cod_05
              ,input
              ,date 
              ,historical
               with cube 
     having GROUPING(id) = 0
            and GROUPING(cod_02) = 0
            and GROUPING(cod_03) = 0
            and GROUPING(date) = 0
            and GROUPING(historical) = 0

This is running in SQL Server.

For 10K rows it is taking 7 seconds, but when I increase the number of rows to the total 107 million it took more than 24 hours.

How can I improve my sentence? Is there a better way to do it?

Advertisement

Answer

Apart from something obvious like adding index on final_state column (if it is selective enough) or even creating a covering index (heavy on storage and will degrade insert/update performance), you could use Grouping Sets instead of Cube+Having.

It will aggregate the data by column combinations you actually need, instead of first calculating all possible combinations with Cube and then filtering them with Having. This might be faster, but if the result of this query also has tens of millions of rows, don’t expect any fireworks.

I tested this on my server (MSSQL 2012) and it turns out that query with Cube+Having performed 6 separate index scans and then concatenated the streams, while query with Grouping Sets that yields the same result performed only one scan and was few times faster.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement