[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: substring filter/index algorithms



At 08:31 AM 10/15/2004, Hallvard B Furuseth wrote:
>Does someone on this list know how to implement and tune substring
>indexing, or know where to ask?  It looks to me like OpenLDAP's
>substring filtering algorithm, and partly the indexing algorithm, can be
>significantly improved.  I'd appreciate advice on how to do this - or
>advice not to bother:-)

My advice would be, in the voice of Elmur Fudd, "vewy vewy carefully" :-).

The problem, aside from being very use (data and query) dependent,
there are numerous factors involved and the interplay between
these factors can be very subtle and quite complex.

>Currently it works like this (thanks to Howard for the explanation):
>
>  A subany index will store a hash of each 4-byte substring of an
>  indexed string, e.g. "abcdef" gets index-values for *abcd*, *bcde* and
>  *cdef*.  A subinitial index gets initial substrings of 2-4 characters:
>  ab*, abc* and abcd*.  Similar for subfinal.  A search for
>  (cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
>  for indexing purposes, stepping 2 characters at a time.
>  The numbers 4 and 2 can now be configured.
>
>I can see several things to improve with that, but I'd rather not
>reinvent an outdated approach:
>
>- Unless I've missed something, the final "9" in the filter above
>  is not used for indexing.
>
>  It would be better to include (cn=*6789*) (&...) filter above, in
>  addition to or instead of the final term.

First, "better" is to vague.

Including (cn=*6789*) lookup may further reduce the candidate set,
but at the expense of doing another index read.  Whether that lookup
is worth doing, well, depends.  I would guess that, in many cases,
it would be worth doing.

>- With a short filter like (cn=*abcde*), one might want to squeeze
>  all info one can get out of it, and use (&(cn=*abcd*)(cn=*bcde*)).
>  With a long filter like (cn=*Hallvard*Furuseth*), maybe
>  (&(cn=*Hall*)(cn=*vard*)(cn=*Furu*)(cn=*seth*)) is quite enough.

Just "llva" and "urse" could be enough.

>  So perhaps instead of configuring that filtering splits substrings by
>  steps of e.g. 2 characters, maybe it would be more useful to configure
>  some sort of measure of how much information to try to extract from
>  the index.

The candidate lookup code could try harder to avoid
subsequent lookups which would likely not further
reduce the candidate set.

>- If one wants indices to work for 2- or 3-character strings, one cannot
>  retain 4-character hashes.  So the above filter must either check
>  search for (&(cn=*123*)(cn=*345*)(cn=*567*)(cn=*789*)), which gives
>  less overlap and thus more false positives, or it must check
>  (&(cn=*123*)(cn=*234*)(cn=*345*)..........(cn=*789*)), which gives a
>  lot more index lookups.
>
>  So it would be nice to be able to configure that e.g. both 3- and
>  4-character substrings are indexed, though that does use more disk
>  space and more hash values.

Though I can see possibly configuring different attributes
(or different context) for different lengths, I doubt it generally
worth maintaining keys at 2 different lengths for the same set of
attributes in context.

>- Some of the above would be nice to configure per index instead of
>  globally, and that 'how much information is wanted' bit might
>  depend on how much information to expect from the other filter
>  components, but I'm not going there at this time unless someone
>  tells me it's easy:-)

It's not easy.