I would like to count the number of unique values based on the tx_id
, here is part of the raw data:
table : Treatment Record +------------------+-----------+----------------+------------------+ | SN | tx_id | pa3 | pa4 | +------------------+-----------+----------------+------------------+ | I2120210007014 | 149362 | V16F2021117016 | V15S2021145018 | | I2120210007014 | 149362 | V15S2021144019 | V15S2021145018 | | I2120210007014 | 149362 | V16F2021117017 | V15S2021145018 | | I2120210007014 | 149362 | V16F2021117017 | V15S2021145018 | | I2120210007014 | 149362 | V16F2021117017 | V15S2021145018 | | I2120210007014 | 148716 | V15C2021116010 | V15C20211091016| +------------------+-----------+----------------+------------------+
For example, the result should look like below:
+------------------+-----------+----------------+-------+ | SN | tx_id | V16F | V15S | V15C | +------------------+-----------+-------+--------+-------+ | I2120210007014 | 149362 | 2 | 2 | 0 | | I2120210007014 | 148716 | 0 | 0 | 2 | +------------------+-----------+----------------+-------+
From the raw data, you can see that there are two different tx_id
and I use this to identify each of the group. Therefore, for insance, all the tx_id = '149362
is in the same group.
And in the column of pa3
and pa4
, there are 2 different group which can be categorized by looking at the first 4 characters, like “V16F”, “V15S”. Moreover, I have to count the number of different wordings in the same group. For example, you can see that column pa3
contains V16F2021117016
, V15S2021144019
, V16F2021117017
while column pa4
contains only . Therefore, there are V15S2021145018
.
Therefore, we count 2 for the group “V16F” and 2 for group “V15S”. You may notice that the counting is not based on the columns pa3
and pa4
but based on the last 4 characters. For example V16F2021117016
andV16F2021117017
, they belong to the same group,”V16F”, but different words since the last 4 characters are ‘7016’ and ‘7017’ respectively.
However I could not find a way out at this moment and only typed some sql code in below. Hopes someone can help me.
SELECT tx_id, sum(case when val like 'V16F%' then 1 else 0 end), sum(case when val2 like 'V15S%' then 1 else 0 end) FROM ( select tx_id, pa3 as val, pa4 as val2 from Cool group by pa3, pa4) GROUP BY tx_id
Here is the wrong output:
+------------------+-----------+----------------+ | SN | tx_id | V16F | V15S | +------------------+-----------+-------+--------+ | I2120210007014 | 149362 | 3 | 3 | | I2120210007014 | 148716 | 0 | 0 | +------------------+-----------+----------------+
Advertisement
Answer
The simplest way to do it is to use UNION ALL
to get all pa3
s and pa4
s in 1 column and then aggregate:
SELECT SN, tx_id, COUNT(DISTINCT CASE WHEN pa LIKE 'V16F%' THEN pa END) V16F, COUNT(DISTINCT CASE WHEN pa LIKE 'V15S%' THEN pa END) V15S, COUNT(DISTINCT CASE WHEN pa LIKE 'V15C%' THEN pa END) V15C FROM ( SELECT SN, tx_id, pa3 pa FROM tablename UNION ALL SELECT SN, tx_id, pa4 pa FROM tablename ) t GROUP BY SN, tx_id
Or, with UNION
, which removes duplicate rows, so there is no need for DISTINCT
:
SELECT SN, tx_id, COUNT(CASE WHEN pa LIKE 'V16F%' THEN pa END) V16F, COUNT(CASE WHEN pa LIKE 'V15S%' THEN pa END) V15S, COUNT(CASE WHEN pa LIKE 'V15C%' THEN pa END) V15C FROM ( SELECT SN, tx_id, pa3 pa FROM tablename UNION SELECT SN, tx_id, pa4 pa FROM tablename ) t GROUP BY SN, tx_id
Which can be further simplified to:
SELECT SN, tx_id, SUM(pa LIKE 'V16F%') V16F, SUM(pa LIKE 'V15S%') V15S, SUM(pa LIKE 'V15C%') V15C FROM ( SELECT SN, tx_id, pa3 pa FROM tablename UNION SELECT SN, tx_id, pa4 pa FROM tablename ) t GROUP BY SN, tx_id
Another way, is to directly use conditional aggregation with more complicated logic that works for this sample data:
SELECT SN, tx_id, COUNT(DISTINCT CASE WHEN pa3 LIKE 'V16F%' THEN pa3 END) + COUNT(DISTINCT CASE WHEN pa4 LIKE 'V16F%' THEN pa4 END) - SUM(pa3 = pa4) V16F, COUNT(DISTINCT CASE WHEN pa3 LIKE 'V15S%' THEN pa3 END) + COUNT(DISTINCT CASE WHEN pa4 LIKE 'V15S%' THEN pa4 END) - SUM(pa3 = pa4) V15S, COUNT(DISTINCT CASE WHEN pa3 LIKE 'V15C%' THEN pa3 END) + COUNT(DISTINCT CASE WHEN pa4 LIKE 'V15C%' THEN pa4 END) - SUM(pa3 = pa4) V15C FROM tablename GROUP BY SN, tx_id
See the demo.