[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: back-bdb performance

> -----Original Message-----
> From: owner-openldap-devel@OpenLDAP.org
> [mailto:owner-openldap-devel@OpenLDAP.org]On Behalf Of Marijn Meijles

> You wrote:
> > nodes/pointers (4 bytes per pointer). Plus 400MB in RDN
> storage. 560MB, not
> > 5GB. Sure, I only have 256MB physical RAM in my build system,
> but 560MB is
> > not a ridiculous requirement for a database of this size. To
> put this into
> > perspective, how much dbcache would you need to configure to get good
> > performance out of a standard backend for this size of database?

> My mistake ;) 560mb is reasonable. However, to read 10 million entries at
> startup is still silly in my point of view.
> 560mb is also without locks which you will need for a bit concurrency.

Technically, it is only reading 10 million RDNs, not 10 million entries.
That would certainly be going overboard to completely parse every entry.

> In openldap2 this is indeed the case. We changed the scheme a
> long time ago,
> Subtree idls suck big time. our dn2id and id2children are just O(log n)

> True, but then again, it's all about tradeoffs. To be honest, I think
> there's a better solution to this than either of us has thought of.
> Maybe I'll think about it during christmas ;)

Yes, it's about tradeoffs. Look at it this way - with the current dn2id
index scheme, you can retrieve a base entry with only 2 database
operations - one to lookup in dn2id, and one to lookup in id2entry. When you
replace dn2id with a hierarchical structure, it takes (H/2) database lookups
to find the ID (where H is the height of your tree). Without some really
good caching you actually lose performance when looking for a base entry.
Then to perform a subtree search, the dn2id database needs only one database
read to obtain the subtree IDL, while the hierarchical approach needs to
recursively extract the id2children info. Again, the hierarchical approach
loses due to increased database I/O. Assuming the entire database is
relatively small, the dn2id approach works very well.

For the hierarchical approach to be generally worthwhile you have to be able
to meet or beat that 3-operation lookup rate. By keeping the entire tree
layout in RAM the hierarchical approach requires only 1 database operation
to obtain an entry (id2entry lookup). I suppose you could call it dumb
caching, because reading the whole tree means you've read things regardless
of whether they will be requested and used. But this approach gives the
absolute minimum per-operation I/O cost. Without it, you always have (H)
database I/Os per search in addition to the id2entry lookup. This overhead
will exceed the "dumb caching" overhead after only a few searches have been

The point of using id2parent on disk instead of id2children is again, you
minimize database I/Os for updates. I designed this approach a couple of
years ago, before I had started using BDB and transactions (I used gdbm).
This approach requires only one database operation to move a subtree, so it
can be treated as atomic from the app level. Using id2children structures
requires unlinking from one parent structure and relinking into another
parent structure, which opens the door for inconsistencies to arise,
particularly if you get a system crash in the middle. I guess in the context
of transactions this resilience is less crucial, but using the minimum
number of database I/Os is still a performance gain.

I'm sure you can come up with a design that has a smaller memory footprint
than this, but you cannot shave off any more I/Os.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support