Best way to improve (fuzzy) search results for sim…

This question may like a duplicate question but I’m giving it back because I couldn’t perfect a solution.

Sorry for that. Now I’m working on a drug search engine. The user cannot always enter the exact drug name. Depending on the characters entered by the user, I need to get a better result for them. But now it is not. How should I do to do result more accurately? For example, there is information in DB: Анальгин, Цитрамон-П, Лирика-75мг, etc. … (in Cyrillic). If the user enters in the Latin alphabet, the system converts it to the Cyrillic alphabet and searches through it. So, I need that: Even if the user enters words in Цитрамон, cитрамон, ииттрамон or cтрамон, the result should return Цитрамон. Or if user enter: Лирика, ЛЛЛирика, иииррика, Лика or Лирик, the result should return Лирика. Or Аналгин, Анагин ... should returns Анальгин. I should get similar results. I tried several search algorithms and packages (Laravel-Searchy, Laravel Scout, TNTSearch and etc) but to no avail. Now a search is working but the results are returning the same as in a SQL LIKE operator. Also, I tried with Levenstein distance as follow as: My search variable:

 $data = [];
 //dd($request->all());
 $search = _ltc($request->qry); // convert input text between latin and crylic 
  $data = Drug::selectRaw('*, levenshtein(?, `name`) as `diff`', [$search])
                ->havingBetween('diff', [0,4])
                ->limit(10)
                ->get();

And my SQL levenshtein function:

CREATE DEFINER = 'root'@'localhost'
  FUNCTION arzondori.levenshtein(s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT(11)
 DETERMINISTIC
BEGIN 
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
DECLARE s1_char CHAR; 
-- max strlen=255 
DECLARE cv0, cv1 VARBINARY(256); 
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
IF s1 = s2 THEN 
 RETURN 0; 
 ELSEIF s1_len = 0 THEN 
 RETURN s2_len; 
 ELSEIF s2_len = 0 THEN 
 RETURN s1_len; 
ELSE 
WHILE j <= s2_len DO 
 SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
END WHILE; 
WHILE i <= s1_len DO 
 SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
WHILE j <= s2_len DO 
  SET c = c + 1; 
  IF s1_char = SUBSTRING(s2, j, 1) THEN  
    SET cost = 0; ELSE SET cost = 1; 
  END IF; 
  SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
  IF c > c_temp THEN SET c = c_temp; END IF; 
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
    IF c > c_temp THEN  
      SET c = c_temp;  
    END IF; 
   SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
  END WHILE; 
      SET cv1 = cv0, i = i + 1; 
     END WHILE; 
    END IF; 
   RETURN c; 
  END

I also considered the following options.:

$data = Drug::select("id", "name")
                ->where('name', 'LIKE', "%$search%")
                ->get();
 $data5 = Searchy::search('drugs')
                   ->fields('name')
                   ->query($search)
                   ->select('id', 'name')
                   ->get();

Can anyone help find a solution to this problem? Sorry in advance if the question is a little confusing! Thanks!

Answer

This will be very hard to solve in the way you are trying. You need a full text search engine like Sphinx or elastic search. That has support for languages and fuzzy search. More info about it here: https://en.wikipedia.org/wiki/Full-text_search

I recommend Sphinx: http://sphinxsearch.com/docs/sphinx3.html#features-overview

The documentation is however very heavy. Another option is elasticsearch that has very good documentation.

Try to invent this your self will be very tricky and you will take time to get good result.

Best way to improve (fuzzy) search results for similarities?

Advertisement

Answer