[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: substring filter/index algorithms



Kurt D. Zeilenga writes:
>At 08:31 AM 10/15/2004, Hallvard B Furuseth wrote:
>
>> Does someone on this list know how to implement and tune substring
>> indexing, or know where to ask?  It looks to me like OpenLDAP's
>> substring filtering algorithm, and partly the indexing algorithm, can be
>> significantly improved.  I'd appreciate advice on how to do this - or
>> advice not to bother:-)
> 
> My advice would be, in the voice of Elmur Fudd, "vewy vewy carefully" :-).
> 
> The problem, aside from being very use (data and query) dependent,
> there are numerous factors involved and the interplay between
> these factors can be very subtle and quite complex.

If anyone wants to explain some of that, or say what to look for or what
to read up on, I'd appreciate it.

>>(...)
>>  A search for
>>  (cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
>>  for indexing purposes, stepping 2 characters at a time.
>>(...)
>>
>>- Unless I've missed something, the final "9" in the filter above
>>  is not used for indexing.
>>
>>  It would be better to include (cn=*6789*) (&...) filter above, in
>>  addition to or instead of the final term.
> 
> First, "better" is to vague.
> 
> Including (cn=*6789*) lookup may further reduce the candidate set,
> but at the expense of doing another index read.  Whether that lookup
> is worth doing, well, depends.

Yes, (cn=*6789*) in _addition_ to the current terms is a mixed blessing.
OTOH, (cn=*6789*) _instead_ of the final term should usually be a win,
since there is less overlap with text which has already been matched.
That's "better" as a global option.

>> (...)
>>  So perhaps instead of configuring that filtering splits substrings by
>>  steps of e.g. 2 characters, maybe it would be more useful to configure
>>  some sort of measure of how much information to try to extract from
>>  the index.
> 
> The candidate lookup code could try harder to avoid
> subsequent lookups which would likely not further
> reduce the candidate set.

Something like, keep checking filter components as long as the list of
candidates is either shrinking or still "too large", or weigh those
factors against each other somehow?  That sounds nice, but also still
seems rather hard to tune.

And then I suppose I'd reduce the substring length to 3 and step to 1,
instead of having two substring lengths?

BTW, another thing which would help is to rearrange the order in which
the substrings are checked - first strings that do not overlap, then
the rest.

>>- If one wants indices to work for 2- or 3-character strings, one cannot
>>  retain 4-character hashes.  So the above filter must either check
>>  search for (&(cn=*123*)(cn=*345*)(cn=*567*)(cn=*789*)), which gives
>>  less overlap and thus more false positives, or it must check
>>  (&(cn=*123*)(cn=*234*)(cn=*345*)..........(cn=*789*)), which gives a
>>  lot more index lookups.
>>
>>  So it would be nice to be able to configure that e.g. both 3- and
>>  4-character substrings are indexed, though that does use more disk
>>  space and more hash values.
> 
> Though I can see possibly configuring different attributes
> (or different context) for different lengths, I doubt it generally
> worth maintaining keys at 2 different lengths for the same set of
> attributes in context.

Well, at our site we do feel it's a problem that 3-byte substrings are
not indexed.  However, if we reduce the substring length to 3, we'll
need a step of 1, which more than doubles the number of substrings to
check.  The increased time to update the index and the extra disk
storage for a 2-length index doesn't seem like much of a problem, but I
don't know how much a larger index slows down database lookups.

OTOH, maybe your above suggestion fixes the problem, so a length-3
index will be OK.

-- 
Hallvard