Skip to content
Advertisement

Avoiding duplicates using stuff function and for xml path

I’m having a problem with duplicates using mysql server with stuff function. The database contains few million records and using distinct is out of question (it seems that the query does not even process).

This is my database structure:

PersonID    Freetext    Importance
PersonID    Freetext    Importance
PersonID    Freetext    Importance
PersonID    Freetext    Importance
PersonID    Freetext    Importance

I have been using following structure for this query. The query works fine (regarding stuff function), but it returns duplicates. For example, if there are five ID’s which are the same, the query returns five rows:

SELECT PersonID, Importance, Freetext = STUFF(
             (SELECT '~' + Freetext
              FROM TABLE t1
              WHERE t1.PersonID = t2.PersonID
              FOR XML PATH (''))
             , 1, 1, '') from TABLE t2
group By PersonID, Importance
order by Importance

How to avoid this?

I would like to have all the freetexts that belong to the same PersonID stuffed together to one single row, ordering the stuffing so that the row that has smallest importance would have it’s freetext stuffed first.

Advertisement

Answer

Just guessing (can’t be sure this gets the desired result without knowing that or the source data), but if you want FOR XML PATH to list the values in order of importance (smallest to largest), you need to move the ORDER BY:

SELECT PersonID, [Freetext] = STUFF(
       (SELECT '~' + Freetext
          FROM dbo.[TABLE] AS t1
          WHERE t1.PersonID = t2.PersonID
          ORDER BY Importance
          FOR XML PATH(''), TYPE
       ).value(N'./text()[1]', N'nvarchar(max)'), 1, 1, '') 
FROM dbo.[Table] AS t2
GROUP BY PersonID;

If you want to include importance somehow, you’ll need to concatenate it with the freetext; it didn’t make sense in the outer query or the group by (the group by importance was actually what was causing your duplicates).

If you need to prevent duplicate values for freetext, you’ll need to add grouping within the subquery. But that gets complicated too; what if you have two identical values for freetext but they were entered with different importance? Do you discard the one with higher or lower importance?

Going back to the original problem, on SQL Server 2017 and above, you can use a much simpler (and more efficient) struct:

SELECT PersonID,  
       [Freetext] = STRING_AGG([Freetext],'~') 
                    WITHIN GROUP (ORDER BY Importance)
  FROM dbo.[Table]
  GROUP BY PersonID;
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement