[Date Prev][Date Next] [Chronological] [Thread] [Top]

approx match at CMU



  There was recently a thread of inquiry on the ITS mailing list about 
getting Approximate Matching working in OpenLDAP. I had taken an interest
in this a while ago, and implemented something that was rather specific to
what we needed here at Carnegie Mellon. I'd like to get this submitted in a
more official way, so I'll lay out here a presentation of what I did.


  In simplified terms, slapd keeps an index database of the form "key=>DN".
When attributes are added to the database via ldapadd or ldapmodify, the
new values are run through the phonetic() function to produce a key.  (The
key string is the equivalent of someone mumbling the word, with different
inflections depending on the soundex algorithm used. Thus "Adamson" will
be mumbled as "edmsm".) The key is added to the index database, and the
value for the key is the DN that got added or modified.  Now when a search
is run to approximately match a search value, that search value is run
through the same phonetic() function, and the key produced is sought in the
index database. The index database returns the DN(s) that will match the
approximate search.

  Here are two ways that approximate matching can be done:

  At the simplest level, all values are treated as a single inseparable
string of characters. A value is run through the phonetic() function which
produces exactly one key. Thus the string "Carnegie Mellon" becomes the key
"krmsmlm".  When a search is being performed by a user, they must search on
something close to that entire string, such as "Crimson Molon". If the
search is for just "Mellon", the search will fail, since "Mellon" becomes
"mlm", which does not match "krmsmlm". The same failure will occur for just
"Carnegie".

  At a higher level of complexity, all values are broken up on word boundaries
and produce multiple keys. The string "Carnegie Mellon" will become the two
keys "krms" and "mlm". When a search is performed, the search string is
also broken into multiple words, and a comparison is made between the pieces.
The search string must have at least one of the pieces from the value in
the database, and if the search string has multiple words, they must all be
present in the database value AND appear in the same order. Thus the
database value "Carnegie Mellon" will hit a match when the search string is
any of "Molon", "Crimson Molon", or "Crimson". But it would NOT match
"Molon Crimson" because of ordering, or "Tom Mellon" because "tom" does not
match any part of "Carnegie Mellon".


  We wanted approximate matching here at CMU for Email routing. Mail addressed
to "mark_adamson@cmu.edu" is supposed to be routed to me, because my "cn" 
attribute matches the address (note that this is after sendmail fails several
other searches, like uid=mark_adamson.). Thus the implementation of
approximate matching will break strings up on underscore, a period, or a
space. Also, approximate matching was made more "cn" specific by accounting
for initials: we wanted mail addressed to "m.adamson" to be delivered. For
this case, any single letter word only needs to match the first letter of a
word in a database string.

  To summarize, any of the following will approx match "Mark A. Adamson" :

	STRING      		Notes
	-------------		------
	mark adamson		exact match
	mark_adamson		underscore separator
	mark.adamson		period separator
	mark        		missing one part
	adamson     
	merk adamson		phonetically close
	mark edemsen
	merk edemsen
	edemsen     		phonetically close and missing parts
	m.adamson   		initials 
	m_adamson   
	m.a.adamson 
	a.adamson   
	a.edemsen    

  and the following are examples of non-matching strings:

	STRING          	Notes
	-------------		-----------
	adamson mark		wrong order
	tom adamson  		non matching part


	
This method of allowing pieces to be missing is very useful for cn type
attributes. Values with small words like "the", and names with and without 
initials, can be found.

  Do other people think this is a useful way of doing approx matching? What
is broken, what is missing?   There is space in schema_init.c for two types
of approx matching (directoryString and IA5String), and I could fill in the
two algorithms (strict match and missing pieces & initials match).
  Because I already wrote the two algorithms and have them running here at
CMU, I wrote this message more as a presentation than a proposal. But I am
willing to try to shim the code into what most people need.


  -Mark Adamson
   Carnegie Mellon


P.S.  We had a phonetic() function in the system that LDAP is replacing, so
I took that soundex algorithm and imported into my copy of OpenLDAP. Thus the
phonetic keys shown in this message are from that algorithm, and NOT from
either of those in servers/slapd/phonetic.c   Sorry.