[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: substring filter/index algorithms

Hallvard B Furuseth wrote:

Does someone on this list know how to implement and tune substring
indexing, or know where to ask?  It looks to me like OpenLDAP's
substring filtering algorithm, and partly the indexing algorithm, can be
significantly improved.  I'd appreciate advice on how to do this - or
advice not to bother:-)

Currently it works like this (thanks to Howard for the explanation):

 A subany index will store a hash of each 4-byte substring of an
 indexed string, e.g. "abcdef" gets index-values for *abcd*, *bcde* and
 *cdef*.  A subinitial index gets initial substrings of 2-4 characters:
 ab*, abc* and abcd*.  Similar for subfinal.  A search for
 (cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
 for indexing purposes, stepping 2 characters at a time.
 The numbers 4 and 2 can now be configured.

I can see several things to improve with that, but I'd rather not
reinvent an outdated approach:

Well, at this point I'm curious about whether trying something like String B-Trees might work better. Since that's a full-text indexing method, it may be expensive for storage, but it may be on par with what we have today. Also since it uses suffix trees it would provide all of subinit, subfinal, and subany indexing in one mechanism, for any length of string. And, it would provide equality and inequality indexing at the same time. I haven't fully digested the papers on the method yet, but it sounds intriguing.

 -- Howard Chu
 Chief Architect, Symas Corp.       Director, Highland Sun
 http://www.symas.com               http://highlandsun.com/hyc
 Symas: Premier OpenSource Development and Support