SQLAlchemy – How to count distinct on multiple columns

Question

I have this query: I&#8217;ve tried to recreate it with SQLAlchemy this way: But this translates to this: Which does not call the count function inline but wraps the select distinct into a subquery. My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and …

Accepted Answer

First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:# NOTE: text() fragments are included in the query as is, so if the text originates# from an untrusted source, the query cannot be trusted.session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).    select_from(Value).    scalar()which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:from itertools import countfrom sqlalchemy import func, distinct as _distinctdef _comma_list(exprs):    # NOTE: Magic number alert, the precedence value must be large enough to avoid    # producing parentheses around the "comma list" when passed to distinct()    ps = count(10 + len(exprs), -1)    exprs = iter(exprs)    cl = next(exprs)    for p, e in zip(ps, exprs):        cl = cl.op(',', precedence=p)(e)    return cldef distinct(*exprs):    return _distinct(_comma_list(exprs))session.query(func.count(distinct(    Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()

Advertisement

Answer