[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: slapd/bdb stability problem



On Thu, 2005-06-02 at 12:42 +0200, Steffen Hansen wrote:
> Hi.
> 
> We use OpenLDAP in the Kolab project, but after switching to the bdb 
> backend there have been several reports about stability problems. Slapd 
> sometimes seems to hang when someone tries to write to the database 
> (for example with ldapadd).
> 
> The complete description is available at 
> https://intevation.de/roundup/kolab/issue707
> 
> Currently we use openldap-2.2.23 and db-4.2.52.2. Do you have any 
> suggestions on how I can get to the bottom of this problem? Anyone else 
> having similar problems? I'm out of ideas here, so any kind of help or 
> suggestion is greatly appreciated.

I've been seeing odd dbd hangs - if you strace -f -p the slapd (or stuck
process) you see it in futex lock(). There's been odd mutterings on the
list but no definite example. A repeated ps listing showing CPU% will
show it tend to zero, but as far as I see, it's really not doing
anything. I have no doubt I could be wrong as my analysis isn't great.

What I find you have to do is, kill -9 the hung process (anyother kill
isn't strong enough). Then check with a db_verify for each db file (I
have to supply -o ). One of these will hang. You'll have to kill -9
that. Do a db_recover which should work. Call db_verify again just to
make sure - it should pass. Now you can restart you slapd process.

As for why, I do not know. Because our imports take so long (even on
sunfire z20s) pull and pushing data takes ages and the hangs occur at
some point later, usually when I chop out a portion of the tree and
ldapadd a new one in. The chop will lock, I'll kill it resulting in
broken db, fixable by my process above. This might be because of our
data - it's a huge mess of data, with some directories contains lots of
entries (>9000). I wouldn't have thought this to be a problem though.
We're also using openldap in with a rootless tree, something else which
may others aren't doing.

All this and it's not even in a live environment. I'm winging it right
now, because when live I wont be doing this brutal surgery of the tree -
and I don't have any other option right now as 2.3.3 isn't ready yet. We
have an 2.2.24 on stock 4.2.52 on rh9 in production which performs
faultlessly but we aren't touching it.

I would love to be able to spend time on investigating but I'm being
pulled in several different directions right now, such is life.

We're using OpenLDAP 2.2.26 with DB 4.2.53 with 3 patches (lock, lock2
and db_transactions). This on Fedora Core 3 x86_64 on Opteron.


-- 
Rob Fielding
rob@dsvr.net

www.dsvr.co.uk                                             Development
Designer Servers                                    Business Serve Plc

Attachment: signature.asc
Description: This is a digitally signed message part