[Date Prev][Date Next] [Chronological] [Thread] [Top]

substring filter/index algorithms

Does someone on this list know how to implement and tune substring
indexing, or know where to ask?  It looks to me like OpenLDAP's
substring filtering algorithm, and partly the indexing algorithm, can be
significantly improved.  I'd appreciate advice on how to do this - or
advice not to bother:-)

Currently it works like this (thanks to Howard for the explanation):

  A subany index will store a hash of each 4-byte substring of an
  indexed string, e.g. "abcdef" gets index-values for *abcd*, *bcde* and
  *cdef*.  A subinitial index gets initial substrings of 2-4 characters:
  ab*, abc* and abcd*.  Similar for subfinal.  A search for
  (cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
  for indexing purposes, stepping 2 characters at a time.
  The numbers 4 and 2 can now be configured.

I can see several things to improve with that, but I'd rather not
reinvent an outdated approach:

- Unless I've missed something, the final "9" in the filter above
  is not used for indexing.

  It would be better to include (cn=*6789*) (&...) filter above, in
  addition to or instead of the final term.

- With a short filter like (cn=*abcde*), one might want to squeeze
  all info one can get out of it, and use (&(cn=*abcd*)(cn=*bcde*)).
  With a long filter like (cn=*Hallvard*Furuseth*), maybe
  (&(cn=*Hall*)(cn=*vard*)(cn=*Furu*)(cn=*seth*)) is quite enough.

  So perhaps instead of configuring that filtering splits substrings by
  steps of e.g. 2 characters, maybe it would be more useful to configure
  some sort of measure of how much information to try to extract from
  the index.

- If one wants indices to work for 2- or 3-character strings, one cannot
  retain 4-character hashes.  So the above filter must either check
  search for (&(cn=*123*)(cn=*345*)(cn=*567*)(cn=*789*)), which gives
  less overlap and thus more false positives, or it must check
  (&(cn=*123*)(cn=*234*)(cn=*345*)..........(cn=*789*)), which gives a
  lot more index lookups.

  So it would be nice to be able to configure that e.g. both 3- and
  4-character substrings are indexed, though that does use more disk
  space and more hash values.

- Some of the above would be nice to configure per index instead of
  globally, and that 'how much information is wanted' bit might
  depend on how much information to expect from the other filter
  components, but I'm not going there at this time unless someone
  tells me it's easy:-)