Skip to content
Advertisement

What is the difference between cube, rollup and groupBy operators?

I can’t find any detailed documentation regarding the differences.

I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy.

Advertisement

Answer

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

is equivalent to:

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

and you compute cube(x, y) with count as an aggregation:

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

Just for comparison lets see the result of plain groupBy:

To summarize:

  • When using plain GROUP BY every row is included only once in its corresponding summary.
  • With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

  • With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement