[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: substring index oddity



John Madden wrote:
"uid=*0371*" dn
# numResponses: 125
# numEntries: 124
real 0m0.052s

Further research on the "allidsthreshold" concept mentioned in the old list thread lead me to SLAPD_LDBM_MIN_MAXIDS, which, at 8192-4, is likely too low for a million objects that were created sequentially. Unfortunately, I'm running Debian for a reason -- going back to compiling from source (as I do now) is a last resort.

(Since I'm using bdb, is the #define even relevant?)

No. Nothing about LDBM has any relevance to back-bdb.
The, uh, "good news" is that the numEntries I'm seeing for the "bad" query is far
below 8188, just 1111.  So perhaps this isn't an allids problem?  For reference,
the searches with numEntries:

uid=*222* : 29 seconds
# numEntries: 3700

As already noted, by default a substring term must have at least 4 characters in order for indexing to have any effect.
uid=*222 : 0.063 seconds
# numEntries: 1000
uid=*2*22 : 0.14 seconds
# numEntries: 3439

And then, just for fun I did:

uid=*2 : 29 seconds
# numEntries: 100000
subinitial and subfinal default to a minimum of two characters, so this search didn't use the index either.
uid=*22 : 0.41 seconds
# numEntries: 10000

...So 10,000 entries can be returned off an index search, well over the 8188. Is
there another allids-like limit someplace?
There is something like that limit, the default is around 131072 in back-bdb (131072 in memory, 65536 for the on-disk index slots) but it doesn't quite mean the same as allids meant in back-ldbm. That is, when back-ldbm hit its limit, it would return ALLIDS which literally meant "all entries in the database." When back-bdb gets to this limit, it turns into a range. The range has a low end at the first/lowest of the ~131,000 IDs, and a high end at the last/highest of the ~131,000 IDs. Once you get into operating as a range, you may get degraded performance, but not quite as bad as back-ldbm. That is, the performance degrades dependong on how discontiguous the original list of IDs was. If they were all contiguous, then there is no degradation. If there were large gaps in the ID list, then performance degrades.

e.g., if the threshold for converting to a range was only 4 IDs, and you had the list 1,2,3,4: when you add ID 5 to the list and get the range 1-5, there is no performance penalty. But if you had the list 1,6,10,15 and add the ID 20 to the list and get the range 1-20, then performance suffers because the search will be examining 20 entries instead of just 5. It's still better than examining every entry in the DB, the way back-ldbm does, but people running larger DBs probably still should increase these limits.

--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sun        http://highlandsun.com/hyc
 OpenLDAP Core Team            http://www.openldap.org/project/