[Date Prev][Date Next]
Re: substring filter/index algorithms
At 08:31 AM 10/15/2004, Hallvard B Furuseth wrote:
>Does someone on this list know how to implement and tune substring
>indexing, or know where to ask? It looks to me like OpenLDAP's
>substring filtering algorithm, and partly the indexing algorithm, can be
>significantly improved. I'd appreciate advice on how to do this - or
>advice not to bother:-)
My advice would be, in the voice of Elmur Fudd, "vewy vewy carefully" :-).
The problem, aside from being very use (data and query) dependent,
there are numerous factors involved and the interplay between
these factors can be very subtle and quite complex.
>Currently it works like this (thanks to Howard for the explanation):
> A subany index will store a hash of each 4-byte substring of an
> indexed string, e.g. "abcdef" gets index-values for *abcd*, *bcde* and
> *cdef*. A subinitial index gets initial substrings of 2-4 characters:
> ab*, abc* and abcd*. Similar for subfinal. A search for
> (cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
> for indexing purposes, stepping 2 characters at a time.
> The numbers 4 and 2 can now be configured.
>I can see several things to improve with that, but I'd rather not
>reinvent an outdated approach:
>- Unless I've missed something, the final "9" in the filter above
> is not used for indexing.
> It would be better to include (cn=*6789*) (&...) filter above, in
> addition to or instead of the final term.
First, "better" is to vague.
Including (cn=*6789*) lookup may further reduce the candidate set,
but at the expense of doing another index read. Whether that lookup
is worth doing, well, depends. I would guess that, in many cases,
it would be worth doing.
>- With a short filter like (cn=*abcde*), one might want to squeeze
> all info one can get out of it, and use (&(cn=*abcd*)(cn=*bcde*)).
> With a long filter like (cn=*Hallvard*Furuseth*), maybe
> (&(cn=*Hall*)(cn=*vard*)(cn=*Furu*)(cn=*seth*)) is quite enough.
Just "llva" and "urse" could be enough.
> So perhaps instead of configuring that filtering splits substrings by
> steps of e.g. 2 characters, maybe it would be more useful to configure
> some sort of measure of how much information to try to extract from
> the index.
The candidate lookup code could try harder to avoid
subsequent lookups which would likely not further
reduce the candidate set.
>- If one wants indices to work for 2- or 3-character strings, one cannot
> retain 4-character hashes. So the above filter must either check
> search for (&(cn=*123*)(cn=*345*)(cn=*567*)(cn=*789*)), which gives
> less overlap and thus more false positives, or it must check
> (&(cn=*123*)(cn=*234*)(cn=*345*)..........(cn=*789*)), which gives a
> lot more index lookups.
> So it would be nice to be able to configure that e.g. both 3- and
> 4-character substrings are indexed, though that does use more disk
> space and more hash values.
Though I can see possibly configuring different attributes
(or different context) for different lengths, I doubt it generally
worth maintaining keys at 2 different lengths for the same set of
attributes in context.
>- Some of the above would be nice to configure per index instead of
> globally, and that 'how much information is wanted' bit might
> depend on how much information to expect from the other filter
> components, but I'm not going there at this time unless someone
> tells me it's easy:-)
It's not easy.