Soundex and Double Metaphone: Explanation of New Search Methods at WorldVitalRecords.com

Posted by on

The following is an explanation of the logic behind the search engines used by WorldVitalRecords.com’s Quick and Advanced Search. Due to the technical nature of the definitions of these algorithms, most of the article will be taken directly from the original reference materials with endnotes explaining sources.Soundex:

“Soundex and Metaphone belong to a class of algorithms usually known as “phonetic encoding” or “sound alike” algorithms – a heuristic type of fuzzy matching. They input a word or name, and return an encoded key, which should be the same for any words that are pronounced similarly – allowing for a reasonable amount of fuzziness..”

“Soundex was developed by Robert Russell and Margaret Odell and patented in 1918 and 1922 (U.S. Patent 1,261,167 and U.S. Patent 1,435,663 ). A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example, the labial B, F, P and V are all encoded as 1. Vowels can affect the coding, but are never coded directly unless they appear at the start of the name.

The exact algorithm is as follows:
1. Retain the first letter of the string
2. Remove all occurrences of the following letters, unless it is the first letter: a, e, h, i, o, u, w, y
3. Assign numbers to the remaining letters (after the first) as follows:
o b, f, p, v = 1
o c, g, j, k, q, s, x, z = 2
o d, t = 3
o l = 4
o m, n = 5
o r = 6

4. If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w (American census only), then omit all but the first.

5. Return the first four characters, right-padding with zeroes if there are fewer than four.

The National Archives and Records Administration (NARA) maintains the rule set for the official implementation of Soundex used by the U.S. Government.

Using this algorithm, both “Robert” and “Rupert” return the same string “R163″ while “Rubin” yields “R150″.”

Double Metaphone

Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It is more accurate than Soundex because it “understands” the basic rules of English pronunciation. …The original author later produced a new version of the algorithm, which he named Double Metaphone, that produces more accurate results than the original algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal.

The algorithm produces keys as its output. Similar sounding words share the same keys and are of variable length.

It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name “Smith” yields a primary code of SM0 and a secondary code of XMT, while the name “Schmidt” yields a primary code of XMT and a secondary code of SMT–both have XMT in common.

Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.

That is why [he] decided to give back two keys for words and names that can be plausibly pronounced more than one way, and that’s why the new version is called Double Metaphone. In the case of Kuczewski, there are two ambiguous sounds, so in the second key returned I make both of the changes: “Kuczewski” now comes back as KSSK for the American version, “Kuhzooski,” as well as KXFS for “Kutchefski.” (I use ‘X’ to represent the “sh” sound, and ‘0′, zero, to represent “th,” as in original Metaphone.) Both versions are likely to be heard in the United States, so it is necessary to try both to be really sure! In the end, however, I find that only about 10% of [the] sample database of the 100,000 most common American surnames come out with more than one key…

The current version of Double Metaphone [used at WorldVitalRecords.com] accounts for alternate pronounciations of names from Italian, Spanish, and French, and from various Germanic and Slavic languages. A few exceptions from English names and words, such as “Thames” and “sugar,” are accounted for also.

The appropriate context for some alternate pronounciations, such as “Zivis-ky” for “Ziwicki,” and “Ro-jay” for a French pronounciation of “Roger,” are too difficult calculate, and so are not given.

[Philips] tried to give back the pronounciation most likely to be heard in the U.S. in the first key, and the native sound in the second key. However, for names like “Artois,” most Americans will correctly drop the ‘s,’ so it comes back as ART first. Most [A]mericans are also likely to give a correct Spanish reading for “Jose” at first glance.

Since many families have changed their last names to a more anglicized spelling, [he has] included “etymological” variations, especially useful for genealogical research, a common application of both Soundex and Metaphone. To this end, consonant groups such as “Schm-” and “-wicz” are automatically given back with the common anglicizations, although they do not really sound the same according to the usual standards of phonetic similarity. But they will allow you to match “Smith” to “Schmidt,” “Filipowicz” to “Philipowitz,” and “Jablonski” to “Yablonsky.”

The idea for using both Soundex and Double Metaphone in matching surnames is to allow for more-thorough searching of the WorldVitalRecords.com website. No genealogist can say that the surname(s) they are searching for has been spelled correctly since the beginning of the use of their surname. English spelling was not standardized until Noah Webster’s A Compendius Dictionary of the English Language in 1806, and through various volumes of that dictionary, English has further evolved.

Using Soundex and Double Metaphone algorithms for searching allows for the changes to American English over time including the use of words from other languages ensuring that WorldVitalRecords.com is able to give a truly international approach to its searching measures.

 

Footnotes:

Lawrence Philips. “The Double Metaphone Search Algorithm,” Mr. Dobb’s Portal: C++. 15 Apr 2003, 2. [Accessed Online 2 Mar 2007.]
“Soundex” Wikipedia.com, (Wikimedia Foundation: St. Petersburg, Florida), 18 Jan 2007. [Accessed Online. 2 Mar 2007.]
Ibid.
Lawrence Philips, “The Double Metaphone Search Algorithm”. Mr. Dobb’s Portal: C++.

Leave a comment