Skip to content
Advertisement

SQLAlchemy – How to count distinct on multiple columns

I have this query:

SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;

I’ve tried to recreate it with SQLAlchemy this way:

session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()

But this translates to this:

SELECT count(*) AS count_1
    FROM (SELECT DISTINCT 
           values.`Serial` AS `values_Serial`, 
           values.`DatumOrig` AS `values_DatumOrig`,
           values.`Glucose` AS `values_Glucose`
          FROM values)
    AS anon_1

Which does not call the count function inline but wraps the select distinct into a subquery.

My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?

Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?

Advertisement

Answer

First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:

  1. If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
  2. COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.

Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.

Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:

# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).
    select_from(Value).
    scalar()

which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:

from itertools import count
from sqlalchemy import func, distinct as _distinct

def _comma_list(exprs):
    # NOTE: Magic number alert, the precedence value must be large enough to avoid
    # producing parentheses around the "comma list" when passed to distinct()
    ps = count(10 + len(exprs), -1)
    exprs = iter(exprs)
    cl = next(exprs)
    for p, e in zip(ps, exprs):
        cl = cl.op(',', precedence=p)(e)

    return cl

def distinct(*exprs):
    return _distinct(_comma_list(exprs))

session.query(func.count(distinct(
    Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement