I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I’ve tried to recreate it with SQLAlchemy
this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1 FROM (SELECT DISTINCT values.`Serial` AS `values_Serial`, values.`DatumOrig` AS `values_DatumOrig`, values.`Glucose` AS `values_Glucose` FROM values) AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?
Advertisement
Answer
First off, I think that COUNT(DISTINCT)
supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW
values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
- If any of
Serial
,DatumOrig
, orGlucose
is NULL in theCOUNT(DISTINCT)
query, that row does not qualify or in other words does not count. COUNT(*)
is the cardinality of the subqueryanon_1
, or in other words the count of rows.SELECT DISTINCT Serial, DatumOrig, Glucose
will include (distinct) rows with NULL.
Looking at EXPLAIN
output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT)
query in SQLAlchemy is a bit tricky, because count()
is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct()
. If all else fails, you can always revert to text()
fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates # from an untrusted source, the query cannot be trusted. session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))). select_from(Value). scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count from sqlalchemy import func, distinct as _distinct def _comma_list(exprs): # NOTE: Magic number alert, the precedence value must be large enough to avoid # producing parentheses around the "comma list" when passed to distinct() ps = count(10 + len(exprs), -1) exprs = iter(exprs) cl = next(exprs) for p, e in zip(ps, exprs): cl = cl.op(',', precedence=p)(e) return cl def distinct(*exprs): return _distinct(_comma_list(exprs)) session.query(func.count(distinct( Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()