Skip to content
Advertisement

tricky SQL with substrings

I have a table (postgres) with a varchar field that has content structured like:

".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."

The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.

That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain “.. John;Smith; ..”

The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].

I’d appreciate if someone could help me out with the SQL to accomplish this.

Advertisement

Answer

Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:

select
  substring(col, 'uuid=([[:alnum:]]+)') as uuid,
  string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;

Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028

If you only want to count:

select
 count(*) as cnt_uuids,
 sum(num_names) as cnt_names,
 sum(num_rows) as cnt_rows
from
(
 select
 count(*) as num_rows,
 count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names
 from mytable
 group by substring(col, 'uuid=([[:alnum:]]+)')
 having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1
) flaws;

But as has been mentioned already: This is not how a database should be used.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement