Excuse my ignorance about this… I’m taking a data analysis course and I stumbled upon this query in an exercise:
SELECT CASE WHEN MIN(REGEXP_CONTAINS(STRING(ActivityDate), DATE_REGEX)) = TRUE THEN "Valid" ELSE "Not Valid" END AS valid_test FROM `tracker_data_clean.daily-activity-clean`;
ActivityDate
is a field that contains date
type data and DATE_REGEX
is a regular expression variable for a date format string.
What I don’t know, is what does taking the MIN()
of this boolean expression REGEX_CONTAINS
do or mean.
I would appreciate if any of you could help me understand the concept of doing this.
Thanks !
Advertisement
Answer
The query selects rows from the table and applies the REGEXP_CONTAINS()
function to every (string-converted) value in the ActivityDate
column. REGEXP_CONTAINS()
will either return true
or false
based on whether the value matches the regex pattern in DATE_REGEX
.
How MIN()
behaves here can vary by implementation:
- Booleans might be coerced as integers, so
MIN()
is evaluating0
‘s and1
‘s. If all the values are1
(true
),MIN()
will be1
(true
), otherwise it will be0
(false
). - Other implementations might evaluate booleans directly, so
MIN()
returnstrue
if all of the values aretrue
, because the minimum value istrue
(true
being “greater” thanfalse
), otherwise it returnsfalse
.
The result, based on the implementation, is that MIN()
returns 0
/1
, or false
/true
. Either way, that result is compared to true
in the CASE
statement. If all values matched the regex, the comparison will be true
.
Basically, the query is “does every row have a valid date in the ActivityDate
column?” The result will be a table with a single column valid_test
and one row, containing “Valid” if they all match, “Not Valid” otherwise.
Another way to look at it that would be relatable to some programming languages is that MIN(bool_function())
is analogous to all()
, meaning return true
if all values are true
. Similarly, MAX(bool_function())
would be analogous to any()
, meaning return true
if any value is true
.