[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: need suggestion on indexing big directory



Quanah Gibson-Mount wrote:
>  Note, that in *repeated* tests
I've done, it was always quicker to "slapcat" the entire database and then "slapadd" it back in, than to run slapindex. There was some work done at one point to fix this problem, I don't recall if it made it into 2.2 or not, IIRC there were some unintended side effects, and it was put off for now.

Yes, there were a few different approaches made, none with any positive effect. Testing for the existence of an entry (so that you can avoid adding it redundantly) took as much execution time as just blindly adding it and catching the error code (when an entry already exists). It's clear that adding an item that already exists in BDB is not a no-op; in several cases the size of the underlying database changed even though the transaction was aborted. I would say this is possibly a BDB bug, but it's hard to trace the real reasons.


Some comparisons on a 330,000 entry db:

running slapindex where I changed a single attribute to have "sub" as well as "eq": > 26 hours

running "slapcat" then "slapadd" for the same DB with memory cache: approx. 2 hours

running "slapcat" then "slapadd" for the same DB with disk cache: approx. 6 hours

There's also the fact that slapindex places a doubled demand on the BDB cache - it requires the entry information in the database to be loaded, crunched into index data, and then written out to the index databases. In the slapadd case, the entry information is read as plain text, so the demand on the BDB cache is much less. It can all be used for deferred writes, whereas for slapindex it must serve both reads and writes. Some of the overhead can be avoided if the entry information is mapped in directly from its database files, instead of being copied into the BDB cache. However, BDB generally will not memory-map files over a certain size, and it won't do it at all if the file has already been used with the main cache. So to take advantage of memory mapping, first you would need to raise the size limit (in DB_CONFIG) to accomodate your id2entry database, and then you would need to run db_recover to flush out all the current id2entry pages, before running slapindex.
--
-- Howard Chu
Chief Architect, Symas Corp. Director, Highland Sun
http://www.symas.com http://highlandsun.com/hyc
Symas: Premier OpenSource Development and Support