Skip to content
Advertisement

Is SQL floating point sum affected by the order-by clause?

Unlike the pure mathematical real numbers, or even the rational numbers, the floating-point number types are not commutative-associative. This means, as is commonly known in numerical coding, the order of a list of floating-point numbers affects the value of the floating-point sum. It can be a good idea to sort the list to put the smaller numbers first before adding up.

An SQL table does have an order. And this can be specified in an order-by clause.

Does the SQL engine sum a field in the order given in the order-by clause?

Can I force the SQL engine to add up starting with the smallest first by sorting a list of positive floating point numbers in ascending order?

Can I force the engine to add up largest first? Or is the order of summation not dependent on the order of the tables? Or perhaps the summation order is related to the order-by order in a complicated and indeterminate manner?

I recognize that this might depend on the choice of SQL engine. My core interest is in SnowFlake.

Advertisement

Answer

Yes, floating point numbers are order specific, and yes, floating point numbers in Snowflake are impacted by this, they have an article somewhere talking about sort order instability and it’s impact on things like ORDER-BY’s, which is the opposite direction from your question.

And in theory yes, ordering a sub-select would allow for controlling for order problems. Unlike for example MS Sql Server which does not allow for ORDER-BY’s on sub-select, Snowflake does allow you to do them. The real issue is do they always respect it. And I would tend to believe they don’t, as most of the operations are parallelable, or which summing is one that is normally “safe”, thus I would suspect you will not be able to force the order to be respected.

One idea would I had that I though would work, was to write the values to a temporary table, with the order by on that, but then you get the problem of read from that will equally likely get parallelized. You could width_bucket to chunk the values into batches and then sum those, and then sum the sums. but again within the chunks you will still get instability.

At this point is numerical stability is your primary concern and the above hackery does not meet your needs I will flip to NUMBER and use the more decimal place form.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement