Skip to content
Advertisement

Distinct performance in Redshift

I am trying to populate a multiple dimension tables from single Base table.

Sample Base Table:

| id | empl_name | emp_surname | country | dept | university |
|----|-----------|-------------|---------|------|------------|
| 1  | AAA       | ZZZ         | USA     | CE   | U_01       |
| 2  | BBB       | XXX         | IND     | CE   | U_01       |
| 3  | CCC       | XXX         | CAN     | IT   | U_02       |
| 4  | CCC       | ZZZ         | USA     | MECH | U_01       |

Required Dimension tables :

  1. emp_name_dim with values – AAA,BBB,CCC
  2. emp_surname_dim with values – ZZZ,XXX
  3. country_dim with values – USA,IND,CAN
  4. dept_dim with values – CE,IT,MECH
  5. university_dim with values – U_01,U_02

Now to populate above dimension tables from base table, I am thinking of 2 approaches

  1. Get distinct values from base table for all above columns combination, create single temp table out of that and use that temp table for subsequent individual dimension table creation. Here, I will be reading data from base table only once but with more column combination.

  2. Create separate temp tables for distinct values specific to each dimension. This way we need to read base table for multiple times, but created temp table will be smaller(i.e. less number of rows and only single column’s distinct values).

Which approach is better if we consider for performance?

Note :

  1. Base table is huge containing millions of rows.
  2. Above columns are just for sample. In actual table there are around 50 columns for which I need to consider for distinct combination.

Advertisement

Answer

Scanning the large table only once is the way to go.

Also there is another way to get the distinct values which in some cases will be faster than distinct. As an alternative approach perform a “group by” on all the columns. Run this as a bake-off to see which is faster. In general if there will be a small number (fits in memory) number of resulting rows from distinct, then distinct will be faster. However, if the result will be large then group by will be faster. There are a lot of corner-cases and factors (distribution style) that can impact this rule-of-thumb so testing both for speed will give you which is faster in your case.

Given that you have 50 columns and you want all the unique combination I’d guess that the output set will be large and that group by will wind but this is just a guess.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement