Skip to content
Advertisement

SQL query to replace multiple occurrences with the most frequent ones

I have a products table, which has the product description in two languages, one in English and one in an alternate language.
Let’s say:

Product_Desc Product_Desc_Alt
A A1
A A2
A A1
A A3
B B1
B B2
B B2
C C1

If I do a GROUP BY statement, there are multiple alternate language occurrences for the same product. So, let’s say:

SELECT Product_Desc, COUNT(DISTINCT Product_Desc_Alt) AS CNT
FROM products
GROUP BY Product_Desc
ORDER BY CNT DESC
Product_Desc CNT
A 3
B 2
C 1

I would like to replace the Product_Desc_Alt occurrences with the most frequent ones, so for example I would like the output to be:

Product_Desc Product_Desc_Alt
A A1
A A1
A A1
A A1
B B2
B B2
B B2
C C1

Obviously, if a product has only one alter lang description, just keep that one.

There may be lots of ways to do that, but I can’t think of one.

I am using Azure Databricks so this could also happen with PySpark, but I am interested in doing this the SQL way.
Thanks a lot!

Advertisement

Answer

Is this what you mean?

UPDATE products
SET Product_Desc_Alt = (
  SELECT TOP 1 Product_Desc_Alt
  FROM   products P2
  WHERE  P2.Product_Desc = products.Product_Desc
  GROUP BY Product_Desc_Alt
  ORDER BY COUNT(*) DESC
)
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement