Skip to content
Advertisement

Finding Similar People’s Names from Database

I have a table in MySql with names in it. I am trying to, given an input name, find all similar names in the table. I’ve heard a lot about Levenshtien/Damerau–Levenshtein distance, but it doesn’t seem like it would work well for this, I’ll explain my reasoning later.

To elaborate:

  • User inputs a name that could have, say, five words in it. For the sake of this example, say the inputted name is “Juan Manuel Beldad.”
  • I attempt to find similar names in the Database. Say the database includes
    1. “Juan Beldad” (missing middle name)
    2. “Juan Belded” (Belded not Beldad)
    3. “Juan Manuel Sebastian Beldad” (extra middle name)
  • I return the them in the order of which ever one is closer to the input, in this case, that would be: “Juan Beldad” ,”Juan Belded”, “Juan Manuel Sebastian Beldad”

My reasoning for questioning the use of Levenshtien/Damerau–Levenshtein distance in this case is that it wouldn’t be able to detect extra names or missing names well. My understanding of Levenshtien distance is that it finds the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. So, the following would be considered to be the same distance from the original string.

Original string: "Juan Beldad"
Want to find: "Juan Manuel Beldad"
(7 character insertion)
Would also find: "Mike Bell"
(5 character substitution (M-i-k-e-l), 2 character deletion(a-d))

Since both have a distance of 7 edits, “Mike Bell” would be considered an equal distance from “Juan Beldad” as “Juan Manuel Beldad” is.

I was thinking about querying the database removing the middle name(s) on both input and table-side, and then doing Levenshtien/Damerau–Levenshtein distance? Am I overthinking this, and is there a better way to do this?

Advertisement

Answer

There are many possible problems you need to consider when matching names. Some of those are:

  • nicknames (Bob – Robert)
  • typos
  • name swap (last name switched with first name)
  • maiden name
  • initials
  • truncated names
  • phonetically similar name (Jennifer – Jenny)

Damerau–Levenshtein distance is one of the edit distance algorithms you can use. Each algorithm accounts for different operations (character insert, replace, delete, swap etc.) and neither is perfect but each provides a distance between two strings.

You need to decide on how much error is acceptable to you (i.e cutoff for positive matches). The example you gave includes minimum 7 operations. In that many operations, many names will return the same distance.

When comparing names, you should try to make both sides comparable by normalizing them: if one side has only the first letter of first name for example, you should do the same on the other side too so that the edit distance algorithm gives you better result.

Similarly, you can get rid of the middle name if the other side does not have the middle name (and you’re okay to ignore cases where a middle name is entered as first name). But a better alternative is to generate all possible first-last name pairs using all words available in a name and see if any of the pairs will produce a better edit distance. You can also compare each word on its own and find the best word combination with the best score (the trade-off is ignoring the typos at word boundaries).

You should also consider using a phonetic similarity algorithm like Double Metaphone in addition to Damerau–Levenshtein and generate a combined score. Phonetic algorithm are designed for specific language family and tries to determine if both names would sound similar in that language family. The result is not reliable on its own (at least my experience was like that) but this combined with an edit distance algorithm will improve your matching.

To reduce the error rate, additional data elements should be considered like ZIP, DOB etc.

In the end, it is all about trade-offs: your intended use case, your acceptable threshold for positive matches, the quality of your data, time/cost limits, etc. For example: you could simply require the first letter of the first name and the first letter of the last name to be the same in addition to Damerau–Levenshtein distance. That will reduce the pool of false-positives with a trade-off ignoring typos at first letters.

Like in many things nowadays, I think the best result in this area could be achieved through a well-trained machine-learning model. I haven’t worked in this area for a while so I’m not sure what’s out there but you could probably find a good cloud based solution for the best quality matches, for a fee of course, if that’s important to you.

You can see an overview of name matching techniques here as further reading.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement