I have a table (postgres) with a varchar field that has content structured like:
".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."
The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.
That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain “.. John;Smith; ..”
The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].
I’d appreciate if someone could help me out with the SQL to accomplish this.
Advertisement
Answer
Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:
select substring(col, 'uuid=([[:alnum:]]+)') as uuid, string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names from mytable group by substring(col, 'uuid=([[:alnum:]]+)') having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;
Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028
If you only want to count:
select count(*) as cnt_uuids, sum(num_names) as cnt_names, sum(num_rows) as cnt_rows from ( select count(*) as num_rows, count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names from mytable group by substring(col, 'uuid=([[:alnum:]]+)') having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1 ) flaws;
But as has been mentioned already: This is not how a database should be used.