Skip to content
Advertisement

when using a case statement to aggregate fields in redshift, is it more performant to replace binary fields with 1s and 0s?

For example, which of the following calculations should perform faster?

sum ( 
    case 
        when fieldA is not null
        then 1
        else 0
    end ) total

or

sum ( 
    case 
        when fieldA is not null
        then fieldB -- binary field, 1 or 0. 
    end ) total

For the sake of this example, assume that when fieldA is not null, fieldB will always equal 1. fieldB can also equal 1 if fieldA is null, which is why I use the case statement.

Advertisement

Answer

The two queries do not do the same thing, unless fieldB is uniformly 1 (or uniformly 1 when fieldA is not NULL). In general, you should run the query that does what you really need.

Redshift is a columnar database. That means that every column used in a query adds overhead to the execution.

Hence, it is better to avoid reading a column if you can. Of course, if the column is referenced elsewhere in the query, then this does not apply.

In addition, SUM() operates on numbers. I’m not sure if “binary” means that the value is a number. If not, then it needs to be converted, which also adds overhead.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement