|
|
Below you will find a set of SoundEx functions that produce identical results
based on identical inputs. They are currently provided in the following
languages:
These functions are provided
under Berkeley styled
licensing (included above each function). If you'd like to contribute this
function in your own favorite language under your own similar licensing,
please let me know.
The function they perform will return SoundEx codes as produced in the census,
or enhanced by techniques documented in this article. The JavaScript version
of the function is identical to the one used in the example SoundEx converter
form provided below.
Note on internationalization:
These functions are only for words that use characters
from the first seven bits of utf-8 (traditionally
called 7-bit ASCII).
Beware of SoundEx functions on the web that claim to be
international. Some will claim, and truly believe, that
their code is international simply because they've used
the standard (internationalized) library
functions to handle strings. This assumption is
often simply a mistake made by
a real programmer who is an El Niño
(stretching a little for the pun there :-) sorry).
|
|
/*
* v 1.0e NEEDS TESTING
* -----------------------
*
* The following SoundEx function is:
*
* (C) Copyright 2002 - 2013, Creativyst, Inc.
* ALL RIGHTS RESERVED
*
* For more information go to:
*
* or email:
* Support@Creativyst.com
*
* Redistribution and use in source and binary
* forms, with or without modification, are
* permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must
* retain the above copyright notice, this
* list of conditions and the following
* disclaimer.
*
* 2. Redistributions in binary form must
* reproduce the above copyright notice,
* this list of conditions and the
* following disclaimer in the
* documentation and/or other materials
* provided with the distribution.
*
* 3. All advertising materials mentioning
* features or use of this software must
* display the following acknowledgement:
* This product includes software developed
* by Creativyst, Inc.
*
* 4. The name of Creativyst, Inc. may not be
* used to endorse or promote products
* derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
*`AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
* PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
* THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
* WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
*
* ------------------
* ------------------
* FUNCTION NOTES:
* 1. To avoid all possibility of overwrites make
* sure *SoundEx points to a buffer with at least
* 11 bytes of storage.
*
* 2. This function is for 7/8-bit ASCII characters.
* Modifications are required for UTF16/32, or for
* anything other than the first 7-bits of utf-8.
*
* 3. For those embedded guys who will understand this:
* This is a true library-grade (i.e. re-usable) function,
* meaning it has no dependencies on outside functions
* and requires no non-standard libraries be linked in
* order for it to work. In this case, since it doesn't
* even require the standard C library, it is what C99
* (I think) calls a: strictly conforming freestanding
* function.
*
*/
int SoundEx(char *SoundEx,
char *WordString,
int LengthOption,
int CensusOption)
{
int InSz = 31;
char WordStr[32]; /* one bigger than InSz */
int SoundExLen, WSLen, i;
char FirstLetter, *p, *p2;
SoundExLen = WSLen = 0;
SoundEx[0] = 0;
if(CensusOption) {
LengthOption = 4;
}
if(LengthOption) {
SoundExLen = LengthOption;
}
if(SoundExLen > 10) {
SoundExLen = 10;
}
if(SoundExLen < 4) {
SoundExLen = 4;
}
if(!WordString) {
return(0);
}
/* Copy WordString to WordStr
* without using funcs from other
* libraries.
*/
for(p = WordString,p2 = WordStr,i = 0;(*p);p++,p2++,i++) {
if(i >= InSz) break;
(*p2) = (*p);
}
(*p2) = 0;
/* Convert WordStr to
* upper-case, without using funcs
* from other libraries
*/
for(p = WordStr;(*p);p++) {
if( (*p) >= 'a' && (*p) <= 'z' ) {
(*p) -= 0x20;
}
}
/* convert all non-alpha
* chars to spaces
*/
for(p = WordStr;(*p);p++) {
if( (*p) < 'A' || (*p) > 'Z' ) {
(*p) = ' ';
}
}
/* Remove leading spaces
*/
for(i = 0, p = p2 = WordStr;(*p);p++) {
if(!i) {
if( (*p) != ' ' ) {
(*p2) = (*p);
p2++;
i++;
}
}
else {
(*p2) = (*p);
p2++;
}
}
(*p2) = 0;
/* Get length of WordStr
*/
for(i = 0,p = WordStr;(*p);p++) i++;
/* Remove trailing spaces
*/
for(;i;i--) {
if(WordStr[i] == ' ') {
WordStr[i] = 0;
}
else {
break;
}
}
/* Get length of WordStr
*/
for(WSLen = 0,p = WordStr;(*p);p++) WSLen++;
if(!WSLen) {
return(0);
}
/* Perform our own multi-letter
* improvements
*
* underscore placeholders (_) will be
* removed below.
*/
if(!CensusOption) {
if(WordStr[0] == 'P' && WordStr[1] == 'S') {
WordStr[0] = '_';
}
if(WordStr[0] == 'P' && WordStr[1] == 'F') {
WordStr[0] = '_';
}
/* v1.0e gh is G-sound at start of word
*/
if(WordStr[0] == 'G' && WordStr[1] == 'H') {
WordStr[1] = '_';
}
for(i = 0;i < WSLen;i++) {
if(WordStr[i] == 'D' && WordStr[i+1] == 'G') {
WordStr[i] = '_';
i++;
continue;
}
if( WordStr[i] == 'G' && WordStr[i+1] == 'H') {
WordStr[i] = '_';
i++;
continue;
}
if(WordStr[i] == 'K' && WordStr[i+1] == 'N') {
WordStr[i] = '_';
i++;
continue;
}
if(WordStr[i] == 'G' && WordStr[i+1] == 'N') {
WordStr[i] = '_';
i++;
continue;
}
if(WordStr[i] == 'M' && WordStr[i+1] == 'B') {
WordStr[i+1] = '_';
i++;
continue;
}
if(WordStr[i] == 'P' && WordStr[i+1] == 'H') {
WordStr[i] = 'F';
WordStr[i+1] = '_';
i++;
continue;
}
if(WordStr[i] == 'T' &&
WordStr[i+1] == 'C' &&
WordStr[i+2] == 'H'
) {
WordStr[i] = '_';
i++; i++;
continue;
}
if(WordStr[i] == 'M' && WordStr[i+1] == 'P'
&& (WordStr[i+2] == 'S' ||
WordStr[i+2] == 'T' ||
WordStr[i+2] == 'Z')
) {
WordStr[i+1] = '_';
i++;
}
}
} /* end if(!CensusOption) */
/* squeeze out underscore characters
* added as a byproduct of above process
* (only needed in c styled replace)
*/
for(p = p2 = WordStr;(*p);p++) {
(*p2) = (*p);
if( (*p2) != '_' ) {
p2++;
}
}
(*p2) = 0;
/* This must be done AFTER our
* multi-letter replacements
* since they could change
* the first letter
*/
FirstLetter = WordStr[0];
/* In case we're in CensusOption
* 1 and the word starts with
* an 'H' or 'W'
* (v1.0c djr: add test for H or W)
*/
if(FirstLetter == 'H' || FirstLetter == 'W') {
WordStr[0] = '-';
}
/* In properly done census
* SoundEx, the H and W will
* be squeazed out before
* performing the test
* for adjacent digits
* (this differs from how
* 'real' vowels are handled)
*/
if(CensusOption == 1) {
for(p = &(WordStr[1]);(*p);p++) {
if((*p) == 'H' || (*p) == 'W') {
(*p) = '.';
}
}
}
/* Perform classic SoundEx
* replacements.
*/
for(p = WordStr;(*p);p++) {
if( (*p) == 'A' ||
(*p) == 'E' ||
(*p) == 'I' ||
(*p) == 'O' ||
(*p) == 'U' ||
(*p) == 'Y' ||
(*p) == 'H' ||
(*p) == 'W'
){
(*p) = '0'; /* zero */
}
if( (*p) == 'B' ||
(*p) == 'P' ||
(*p) == 'F' ||
(*p) == 'V'
){
(*p) = '1';
}
if( (*p) == 'C' ||
(*p) == 'S' ||
(*p) == 'G' ||
(*p) == 'J' ||
(*p) == 'K' ||
(*p) == 'Q' ||
(*p) == 'X' ||
(*p) == 'Z'
){
(*p) = '2';
}
if( (*p) == 'D' ||
(*p) == 'T'
){
(*p) = '3';
}
if( (*p) == 'L' ) {
(*p) = '4';
}
if( (*p) == 'M' ||
(*p) == 'N'
){
(*p) = '5';
}
if( (*p) == 'R' ) {
(*p) = '6';
}
}
/* soundex replacement loop done */
/* In properly done census
* SoundEx, the H and W will
* be squezed out before
* performing the test
* for adjacent digits
* (this differs from how
* 'real' vowels are handled)
*/
if(CensusOption == 1) {
/* squeeze out dots
*/
for(p = p2 = &WordStr[1];(*p);p++) {
(*p2) = (*p);
if( (*p2) != '.' ) {
p2++;
}
}
(*p2) = 0;
}
/* squeeze out extra equal adjacent digits
* (don't include first letter)
* v1.0c djr (now includes first letter)
*/
for(p = p2 = &(WordStr[0]);(*p);p++) {
(*p2) = (*p);
if( (*p2) != p[1] ) {
p2++;
}
}
(*p2) = 0;
/* squeeze out spaces and zeros
* Leave the first letter code
* to be replaced below.
* (In case it made a zero)
*/
for(p = p2 = &WordStr[1];(*p);p++) {
(*p2) = (*p);
if( (*p2) != ' ' && (*p2) != '0' ) {
p2++;
}
}
(*p2) = 0;
/* Get length of WordStr
*/
for(WSLen = 0,p = WordStr;(*p);p++) WSLen++;
/* Right pad with zero characters
*/
for(i = WSLen;i < SoundExLen;i++ ) {
WordStr[i] = '0';
}
/* Size to taste
*/
WordStr[SoundExLen] = 0;
/* Replace first digit with
* first letter.
*/
WordStr[0] = FirstLetter;
/* Copy WordStr to SoundEx
*/
for(p2 = SoundEx,p = WordStr;(*p);p++,p2++) {
(*p2) = (*p);
}
(*p2) = 0;
return(SoundExLen);
}
|
|
/*
* v 1.0e NEEDS TESTING
* -----------------------
*
* The following SoundEx function is:
*
* (C) Copyright 2002 - 2013, Creativyst, Inc.
* ALL RIGHTS RESERVED
*
* For more information go to:
*
* or email:
* Support@Creativyst.com
*
* Redistribution and use in source and binary
* forms, with or without modification, are
* permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must
* retain the above copyright notice, this
* list of conditions and the following
* disclaimer.
*
* 2. Redistributions in binary form must
* reproduce the above copyright notice,
* this list of conditions and the
* following disclaimer in the
* documentation and/or other materials
* provided with the distribution.
*
* 3. All advertising materials mentioning
* features or use of this software must
* display the following acknowledgement:
* This product includes software developed
* by Creativyst, Inc.
*
* 4. The name of Creativyst, Inc. may not be
* used to endorse or promote products
* derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
* PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
* THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
* WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
*/
function SoundEx(WordString, LengthOption, CensusOption)
{
var TmpStr;
var WordStr = "";
var CurChar;
var LastChar;
var SoundExLen = 10;
var WSLen;
var FirstLetter;
if(CensusOption) {
LengthOption = 4;
}
if(LengthOption != undefined) {
SoundExLen = LengthOption;
}
if(SoundExLen > 10) {
SoundExLen = 10;
}
if(SoundExLen < 4) {
SoundExLen = 4;
}
if(!WordString) {
return("");
}
WordString = WordString.toUpperCase();
/* Clean and tidy
*/
WordStr = WordString;
WordStr = WordStr.replace(/[^A-Z]/gi, " "); // rpl non-chars w space
WordStr = WordStr.replace(/^\s*/g, ""); // remove leading space
WordStr = WordStr.replace(/\s*$/g, ""); // remove trailing space
/* Some of our own improvements
*/
if(!CensusOption) {
/* v1.0e: GH at begining of word has G-sound (e.g., ghost)
*/
WordStr = WordStr.replace(/^GH/g, "G"); // Chng leadng GH to G
WordStr = WordStr.replace(/DG/g, "G"); // Change DG to G
WordStr = WordStr.replace(/GH/g, "H"); // Change GH to H
WordStr = WordStr.replace(/GN/g, "N"); // Change GN to N
WordStr = WordStr.replace(/KN/g, "N"); // Change KN to N
WordStr = WordStr.replace(/PH/g, "F"); // Change PH to F
WordStr =
WordStr.replace(/MP([STZ])/g, "M$1"); // MP if fllwd by ST|Z
WordStr = WordStr.replace(/^PS/g, "S"); // Chng leadng PS to S
WordStr = WordStr.replace(/^PF/g, "F"); // Chng leadng PF to F
WordStr = WordStr.replace(/MB/g, "M"); // Chng MB to M
WordStr = WordStr.replace(/TCH/g, "CH"); // Chng TCH to CH
}
/* The above improvements may
* have changed this first letter
*/
FirstLetter = WordStr.substr(0,1);
/* in case 1st letter is
* an H or W and we're in
* CensusOption = 1
*/
if(FirstLetter == "H" || FirstLetter == "W") {
TmpStr = WordStr.substr(1);
WordStr = "-";
WordStr += TmpStr;
}
/* In properly done census
* SoundEx the H and W will
* be squezed out before
* performing the test
* for adjacent digits
* (this differs from how
* 'real' vowels are handled)
*/
if(CensusOption == 1) {
WordStr = WordStr.replace(/[HW]/g, ".");
}
/* Begin Classic SoundEx
*/
WordStr = WordStr.replace(/[AEIOUYHW]/g, "0");
WordStr = WordStr.replace(/[BPFV]/g, "1");
WordStr = WordStr.replace(/[CSGJKQXZ]/g, "2");
WordStr = WordStr.replace(/[DT]/g, "3");
WordStr = WordStr.replace(/[L]/g, "4");
WordStr = WordStr.replace(/[MN]/g, "5");
WordStr = WordStr.replace(/[R]/g, "6");
/* Properly done census:
* squeze H and W out
* before doing adjacent
* digit removal.
*/
if(CensusOption == 1) {
WordStr = WordStr.replace(/\./g, "");
}
/* Remove extra equal adjacent digits
*/
WSLen = WordStr.length;
LastChar = "";
TmpStr = "";
// removed v10c djr: TmpStr = "-"; /* rplcng skipped first char */
for(i = 0; i < WSLen; i++) {
CurChar = WordStr.charAt(i);
if(CurChar == LastChar) {
TmpStr += " ";
}
else {
TmpStr += CurChar;
LastChar = CurChar;
}
}
WordStr = TmpStr;
WordStr = WordStr.substr(1); /* Drop first letter code */
WordStr = WordStr.replace(/\s/g, ""); /* remove spaces */
WordStr = WordStr.replace(/0/g, ""); /* remove zeros */
WordStr += "0000000000"; /* pad with zeros on right */
WordStr = FirstLetter + WordStr; /* Add first letter of word */
WordStr = WordStr.substr(0,SoundExLen); /* size to taste */
return(WordStr);
}
|
|
# v 1.0e NEEDS TESTING
# -----------------------
# The following SoundEx function is:
#
# (C) Copyright 2002 - 2013, Creativyst, Inc.
# ALL RIGHTS RESERVED
#
# For more information go to:
#
# or email:
# Support@Creativyst.com
#
# Redistribution and use in source and binary
# forms, with or without modification, are
# permitted provided that the following conditions
# are met:
#
# 1. Redistributions of source code must
# retain the above copyright notice, this
# list of conditions and the following
# disclaimer.
#
# 2. Redistributions in binary form must
# reproduce the above copyright notice,
# this list of conditions and the
# following disclaimer in the
# documentation and/or other materials
# provided with the distribution.
#
# 3. All advertising materials mentioning
# features or use of this software must
# display the following acknowledgement:
# This product includes software developed
# by Creativyst, Inc.
#
# 4. The name of Creativyst, Inc. may not be
# used to endorse or promote products
# derived from this software without
# specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
# ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
# THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
# OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
# WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
# ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#
sub SoundEx
{
my($WordString, $LengthOption, $CensusOption) = @_;
my($WordStr, $CurChar, $LastChar, $SoundExLen);
my($WSLen, $FirstLetter, $TmpStr);
if($CensusOption) {
$LengthOption = 4;
}
if($LengthOption) {
$SoundExLen = $LengthOption;
}
if($SoundExLen > 10) {
$SoundExLen = 10;
}
if($SoundExLen < 4) {
$SoundExLen = 4;
}
if(!$WordString) {
return("");
}
$WordString = uc($WordString);
# Clean and tidy
#
$WordStr = $WordString;
$WordStr =~ s/[^A-Z]/ /sig; # replace non-chars with space
$WordStr =~ s/^\s*//sg; # remove leading space
$WordStr =~ s/\s*$//sg; # remove trailing space
# Some of our own improvements
#
if(!$CensusOption) {
# v1.0e: GH at start of word is G-sound (e.g., ghost)
#
$WordStr =~ s/^GH/G/sg; # Change leading GS to G
$WordStr =~ s/DG/G/sg; # Change DG to G
$WordStr =~ s/GH/H/sg; # Change GH to H
$WordStr =~ s/KN/N/sg; # Change KN to N
$WordStr =~ s/GN/N/sg; # Change GN to N
$WordStr =~ s/MB/M/sg; # Change MB to M
$WordStr =~ s/PH/F/sg; # Change PH to F
$WordStr =~ s/TCH/CH/sg; # Change TCH to CH
$WordStr =~ s/MP([STZ])/M$1/sg; # MP if follwd by S|T|Z
$WordStr =~ s/^PS/S/sg; # Change leading PS to S
$WordStr =~ s/^PF/F/sg; # Change leading PF to F
}
# Done here because the
# above improvements could
# change this first letter
#
$FirstLetter = substr($WordStr,0,1);
# in case 1st letter is
# an H or W and we're in
# CensusOption = 1
# (add test for 'H'/'W' v1.0c djr)
#
if($FirstLetter eq "H" || $FirstLetter eq "W") {
$TmpStr = substr($WordStr,1);
$WordStr = "-$TmpStr";
}
# In properly done census
# SoundEx: the H and W will
# be squeezed out before
# performing the test for
# adjacent digits
# (this differs from how
# the 'real' vowels are
# handled)
#
if($CensusOption == 1) {
$WordStr =~ s/[HW]/\./sg;
}
# Begin Classic SoundEx
#
$WordStr =~ s/[AEIOUYHW]/0/sg;
$WordStr =~ s/[BPFV]/1/sg;
$WordStr =~ s/[CSGJKQXZ]/2/sg;
$WordStr =~ s/[DT]/3/sg;
$WordStr =~ s/L/4/sg;
$WordStr =~ s/[MN]/5/sg;
$WordStr =~ s/R/6/sg;
# Properly done census:
# squeeze H and W out
# before doing adjacent
# digit removal.
#
if($CensusOption == 1) {
$WordStr =~ s/\.//sg;
}
# Remove extra equal adjacent digits
#
$WSLen = length($WordStr);
$LastChar = "";
# v1.0c rmv: $TmpStr = "-"; # rplc skipped 1st char
$TmpStr = "";
for($i = 0; $i < $WSLen;$i++) { # v1.0c now org-0
$CurChar = substr($WordStr,$i,1);
if($CurChar eq $LastChar) {
$TmpStr .= " ";
}
else {
$TmpStr .= $CurChar;
$LastChar = $CurChar;
}
}
$WordStr = $TmpStr;
$WordStr = substr($WordStr,1); # Drop first ltr code
$WordStr =~ s/\s//sg; # remove spaces
$WordStr =~ s/0//sg; # remove zeros
$WordStr .= "0000000000"; # pad w/0s on rght
$WordStr = "$FirstLetter$WordStr"; # Add 1st ltr of wrd
$WordStr = substr($WordStr,0,$SoundExLen); # size to taste
return($WordStr);
}
|
|
'
' v 1.0d NEEDS TESTING
' -----------------------
'
' The following SoundEx function is:
'
' (C) Copyright 2002 - 2013, Creativyst, Inc.
' ALL RIGHTS RESERVED
'
' For more information go to:
'
' or email:
' Support@Creativyst.com
'
' Redistribution and use in source and binary
' forms, with or without modification, are
' permitted provided that the following conditions
' are met:
'
' 1. Redistributions of source code must
' retain the above copyright notice, this
' list of conditions and the following
' disclaimer.
'
' 2. Redistributions in binary form must
' reproduce the above copyright notice,
' this list of conditions and the
' following disclaimer in the
' documentation and/or other materials
' provided with the distribution.
'
' 3. All advertising materials mentioning
' features or use of this software must
' display the following acknowledgement:
' This product includes software developed
' by Creativyst, Inc.
'
' 4. The name of Creativyst, Inc. may not be
' used to endorse or promote products
' derived from this software without
' specific prior written permission.
'
' THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
' ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
' INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
' WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
' PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
' THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
' INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
' DAMAGES (INCLUDING, BUT NOT LIMITED TO,
' PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
' OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
' HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
' WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
' (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
' WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
' ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'
'
'
Function SoundEx( _
ByVal WordString As String, _
ByVal LengthOption As Integer, _
ByVal CensusOption As Integer _
) As String
Dim WordStr As String
Dim b, b2, b3, SoundExLen, FirstLetter As String
Dim i As Integer
' Sanity
'
If (CensusOption > 0) Then
LengthOption = 4
End If
If (LengthOption > 0) Then
SoundExLen = LengthOption
End If
If (SoundExLen > 10) Then
SoundExLen = 10
End If
If (SoundExLen < 4) Then
SoundExLen = 4
End If
If (Len(WordString) < 1) Then
Exit Function
End If
' Copy to WordStr
' and UpperCase
'
WordStr = UCase(WordString)
' Convert all non-alpha
' chars to spaces. (thanks John)
'
For i = 1 To Len(WordStr)
b = Mid(WordStr, i, 1)
If (Not (b Like "[A-Z]")) Then
WordStr = Replace(WordStr, b, " ")
End If
Next i
' Remove leading and
' trailing spaces
'
WordStr = Trim(WordStr)
' sanity
'
If (Len(WordStr) < 1) Then
Exit Function
End If
' Perform our own multi-letter
' improvements
'
' double letters will be effectively
' removed in a later step.
'
If (CensusOption < 1) Then
b = Mid(WordStr, 1, 1)
b2 = Mid(WordStr, 2, 1)
If (b = "P" And b2 = "S") Then
WordStr = Replace(WordStr, "PS", "S", 1, 1)
End If
If (b = "P" And b2 = "F") Then
WordStr = Replace(WordStr, "PF", "F", 1, 1)
End If
' v1.0e if gh at start of word G-sound (e.g., ghost)
'
If (b = "G" And b2 = "H") Then
WordStr = Replace(WordStr, "GH", "G", 1, 1)
End If
WordStr = Replace(WordStr, "DG", "_G")
WordStr = Replace(WordStr, "GH", "_H")
WordStr = Replace(WordStr, "KN", "_N")
WordStr = Replace(WordStr, "GN", "_N")
WordStr = Replace(WordStr, "MB", "M_")
WordStr = Replace(WordStr, "PH", "F_")
WordStr = Replace(WordStr, "TCH", "_CH")
WordStr = Replace(WordStr, "MPS", "M_S")
WordStr = Replace(WordStr, "MPT", "M_T")
WordStr = Replace(WordStr, "MPZ", "M_Z")
End If
' end if(Not CensusOption)
'
' Sqeeze out the extra _ letters
' from above (not strictly needed
' in VB but used in C code)
'
WordStr = Replace(WordStr, "_", "")
' This must be done AFTER our
' multi-letter replacements
' since they could change
' the first letter
'
FirstLetter = Mid(WordStr, 1, 1)
' in case first letter is
' an h, a w ...
' we'll change it to something
' that doesn't match anything
'
If (FirstLetter = "H" Or FirstLetter = "W") Then
b = Mid(WordStr, 2)
WordStr = "-" + b
End If
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, "H", ".")
WordStr = Replace(WordStr, "W", ".")
End If
' Perform classic SoundEx
' replacements
' Here, we use ';' instead of zero '0'
' because of MS strangeness with leading
' zeros in some applications.
'
WordStr = Replace(WordStr, "A", ";")
WordStr = Replace(WordStr, "E", ";")
WordStr = Replace(WordStr, "I", ";")
WordStr = Replace(WordStr, "O", ";")
WordStr = Replace(WordStr, "U", ";")
WordStr = Replace(WordStr, "Y", ";")
WordStr = Replace(WordStr, "H", ";")
WordStr = Replace(WordStr, "W", ";")
WordStr = Replace(WordStr, "B", "1")
WordStr = Replace(WordStr, "P", "1")
WordStr = Replace(WordStr, "F", "1")
WordStr = Replace(WordStr, "V", "1")
WordStr = Replace(WordStr, "C", "2")
WordStr = Replace(WordStr, "S", "2")
WordStr = Replace(WordStr, "G", "2")
WordStr = Replace(WordStr, "J", "2")
WordStr = Replace(WordStr, "K", "2")
WordStr = Replace(WordStr, "Q", "2")
WordStr = Replace(WordStr, "X", "2")
WordStr = Replace(WordStr, "Z", "2")
WordStr = Replace(WordStr, "D", "3")
WordStr = Replace(WordStr, "T", "3")
WordStr = Replace(WordStr, "L", "4")
WordStr = Replace(WordStr, "M", "5")
WordStr = Replace(WordStr, "N", "5")
WordStr = Replace(WordStr, "R", "6")
'
' End Clasic SoundEx replacements
'
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, ".", "")
End If
' squeeze out extra equal adjacent digits
' (don't include first letter)
'
b = ""
b2 = ""
' remove from v1.0c djr: b3 = Mid(WordStr, 1, 1)
b3 = ""
For i = 1 To Len(WordStr) ' i=1 (not 2) in v1.0c
b = Mid(WordStr, i, 1)
b2 = Mid(WordStr, (i + 1), 1)
If (Not (b = b2)) Then
b3 = b3 + b
End If
Next i
WordStr = b3
If (Len(WordStr) < 1) Then
Exit Function
End If
' squeeze out spaces and zeros (;)
' Leave the first letter code
' to be replaced below.
' (In case it made a zero)
'
WordStr = Replace(WordStr, " ", "")
b = Mid(WordStr, 1, 1)
WordStr = Replace(WordStr, ";", "")
If (b = ";") Then ' only if it got removed above
WordStr = b + WordStr
End If
' Right pad with zero characters
'
b = String(SoundExLen, "0")
WordStr = WordStr + b
' Replace first digit with
' first letter
'
WordStr = Mid(WordStr, 2)
WordStr = FirstLetter + WordStr
' Size to taste
'
WordStr = Mid(WordStr, 1, SoundExLen)
' Copy WordStr to SoundEx
'
SoundEx = WordStr
End Function
|
The following links provide resources for those who'd like to learn more about
the SoundEx algorithm, it's limitations, enhancements, and uses.
The following form uses the JavaScript SoundEx function presented above to
produce a SoundEx code for any word you enter into the field labeled
"Input Word:". It returns the SoundEx codes for three different
algorithms in the fields below. You may also enter a size for the output code
within limits (4 to 10 characters).
To test differences in the enhanced SoundEx option, use words and names like
knightridder, psychology, and Pflanders. The word
Knight will demonstrate two different enhancements as you type
it. To see the differences between the two census options, try
Ashcroft.
If you've written a SoundEx function in another
language and would like to share it as part of this article, please contact me. If used,
your code will be attributed to you with a link to your site.
|
|