[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: bdb corruption

To: "'Quanah Gibson-Mount'" <quanah@stanford.edu>, "'Matthew Schumacher'" <matt.s@aptalaska.net>, "'Matthew Burgoon'" <openldap-users@temporal.shutdown.com>
Subject: RE: bdb corruption
From: "Howard Chu" <hyc@symas.com>
Date: Thu, 20 Nov 2003 16:22:59 -0800
Cc: <openldap-software@OpenLDAP.org>
Importance: Normal
In-reply-to: <504103342.1069337266@cadabra-dsl.stanford.edu>

> -----Original Message-----
> From: owner-openldap-software@OpenLDAP.org
> [mailto:owner-openldap-software@OpenLDAP.org]On Behalf Of Quanah
Gibson-Mount

> --On Thursday, November 20, 2003 12:38 PM -0900 Matthew Schumacher
> <matt.s@aptalaska.net> wrote:
> >  From the reading I have been doing on this list, this is a bdb bug that
> > has no known timeline for a fix.  Some people say it's avoidable by
> > setting a large cache size (big enough for your entire db) in your
> > DB_CONFIG file.  Others avoid the problem by using solaris.
> >
> > List: Please correct me if I'm wrong, but that is the conclusion I have
> > come to dealing with this problem.
> >
> > Anyone know exactly what is going wrong in bdb, are they aware of the
bug?

Yes, SleepyCat is aware of the problems and the 4.2 release addresses these
issues.
>
> I use solaris AND put everything into a cache.  I believe BDB 4.2 is
> supposed to have fixes to the problem, but not having a proper cache size
> will still cause performance issues.

This is a significant point and bears repeating:

Not having a proper cache size will cause performance issues.

Having a proper cache size will avoid these issues.

There is no actual corruption occurring in the database. It is merely the
fact that the cache is thrashing itself that causes performance/response time
to slowdown. When you take the time to actually read the documentation,
measure the library performance using db_stat, and tune your settings, you
will not run into these problems.

It is not absolutely necessary to configure a BDB cache equal in size to your
entire database. All that you need is a cache that's large enough for your
"working set." That means, large enough to hold all of the most frequently
accessed data, plus a few less-frequently accessed items.

Let me spell out what that really means here, in detail:

Start with the most obvious - the back-bdb database lives in two main files,
dn2id.bdb and id2entry.bdb. These are B-tree databases. We have never
documented the back-bdb internal layout before, because it didn't seem like
something anyone should have to worry about, nor was it necessarily cast in
stone. But here's how it works today, in OpenLDAP 2.1 and 2.2.

A B-tree is a balanced tree; it stores data in its leaf nodes and bookkeeping
data in its interior nodes. (If you don't know what tree data structures look
like in general, Google for some references, because that's getting far too
elementary for the purposes of this discussion.)

For decent performance, you need enough cache memory to contain all the nodes
along the path from the root of the tree down to the particular data item
you're accessing. That's enough cache for a single search. For the general
case, you want enough cache to contain all the internal nodes in the
database. "db_stat -d" will tell you how many internal pages are present in a
database. You should check this number for both dn2id and id2entry.

Also note that id2entry always uses 16KB per "page", while dn2id uses
whatever the underlying filesystem uses, typically 4 or 8KB. To avoid
thrashing the cache and triggering these infinite hang bugs in BDB 4.1.25,
your cache must be at least as large as the number of internal pages in both
the dn2id and id2entry databases, plus some extra space to accomodate the
actual leaf data pages.

For example, in my OpenLDAP 2.2 test database, I have an input LDIF file
that's about 360MB. With the back-hdb backend this creates a dn2id.bdb that's
68MB, and an id2entry that's 800MB. db_stat tells me that dn2id uses 4KB
pages, has 433 internal pages, and 6378 leaf pages. The id2entry uses 16KB
pages, has 52 internal pages, and 45912 leaf pages. In order to efficiently
retrieve any single entry in this database, the cache should be at least
(433+1) * 4KB + (52+1) * 16KB in size: 1736KB + 848KB =~ 2.5MB. This doesn't
take into account other library overhead, so this is even lower than the
barest minimum. The default cache size, when nothing is configured, is only
256KB. If you tried to do much of anything with this database and only
default settings, BDB 4.1.25 would lock up in an infinite loop.

This 2.5MB number also doesn't take indexing into account. Each indexed
attribute uses another database file of its own, using a Hash structure.
Unlike the B-trees, where you only need to touch one data page to find an
entry of interest, doing an index lookup generally touches multiple keys, and
the point of a hash structure is that the keys are evenly distributed across
the data space. That means there's no convenient compact subset of the
database that you can keep in the cache to insure quick operation, you can
pretty much expect references to be scattered across the whole thing. My
strategy here would be to provide enough cache for at least 50% of all of the
hash data. (Number of hash buckets + number of overflow pages + number of
duplicate pages) * page size / 2.

The objectClass index for my example database is 5.9MB and uses 3 hash
buckets and 656 duplicate pages. So ( 3 + 656 ) * 4KB / 2 =~ 1.3MB.

With only this index enabled, I'd figure at least a 4MB cache for this
backend. (Of course you're using a single cache shared among all of the
database files, so the cache pages will most likely get used for something
other than what you accounted for, but this gives you a fighting chance.)

With this 4MB cache I can slapcat this entire database on my 1.3GHz PIII in 1
minute, 40 seconds. With the cache doubled to 8MB, it still takes the same
1:40s. Once you've got enough cache to fit the B-tree internal pages,
increasing it further won't have any effect until the cache really is large
enough to hold 100% of the data pages. I don't have enough free RAM to hold
all the 800MB id2entry data, so 4MB is good enough.

And *that* is my definition of "how big a cache is big enough?"

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

Follow-Ups:
- Re: bdb corruption
  - From: Matthew Schumacher <matt.s@aptalaska.net>

References:
- Re: bdb corruption
  - From: Quanah Gibson-Mount <quanah@stanford.edu>

Prev by Date: RE: bdb corruption
Next by Date: OpenLDAP C Client Libraries
Index(es):
- Chronological
- Thread