[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Dealing with BDB Crash

To: ldap@mm.st
Subject: Re: Dealing with BDB Crash
From: Aaron Richton <richton@nbcs.rutgers.edu>
Date: Wed, 30 Mar 2011 18:43:48 -0400 (EDT)
Cc: openldap-technical@openldap.org
In-reply-to: <1301523028.19376.1435697865@webmail.messagingengine.com>
References: <1301523028.19376.1435697865@webmail.messagingengine.com>
User-agent: Alpine 2.00 (SOC 1167 2008-08-23)

So, the best defense is a good offense in this case, and if you wererunning 2.4.25 with the appropriate BerkeleyDB library you'd likely notsee an issue of this manner.

With that said, there was a time (with earlier releases of OpenLDAP) whenwe had this issue (one bdb go down, with the service apparently workingvia an overly simple smoke test). Not being fans of being bitten by thesame failure mode twice, we wrote up a Nagios check that searches aknown-present-on-disk entry that is in each of our databases. (You caneither create one, or (ab)use "ou=People" if you're RFC2307 or use"cn=Manager" or what have you...) If any database doesn't return a hit,time for us to get a call.

As an aside, I find this thoroughly fascinating timing. Not that it'llmake you feel any better in the present case, but I was just consideringwriting something up for the next LDAPCon on how we do monitoring (thereare ~10 angles we check from, many of them due to real life situationssimilar to yours). They're all relatively simple ideas like the above, butI suppose cleaning up our code to the point where it's world-safe andgetting something written up on it may be useful. They've provenoccasionally useful for slapd(8) code issues and also, more frequently,useful in the face of human factors.


On Wed, 30 Mar 2011, ldap@mm.st wrote:

A while ago I posted that we were having what we thought were random bdb
backend crashes with the following in our log:

bdb(o=example.com): PANIC: fatal region error detected; run recovery.

This was on a on our RH5 openldap servers (2.3.43) that we were
rebuilding:

It appears that the crashes were caused by a vulnerability scanner that
was hitting the server (still testing), even though it was suppose to be
safe.  We'll have to investigate what is causing it, maybe we will need
an acl to stop whatever the scanner is doing.  Once we stopped the
automated scan, the servers seem to be running as expected.

But, this brought up another issue.  When the bdb backend failed, the
slapd process continued run and listen on the ldap ports and clients
still tried to connect to the failed server for authentication.  The
server accepted and established the connection with the client.  Of
course the client could not authenticate since the backend db was down.
The client will not fail over to the other server that is listed in it's
ldap.conf file since it thinks it has a valid connection.  If the slap
process is not running then the fail over works fine since no ports are
there for the client to connect to.

I'm thinking that bdb failures will be rare once we solve the scanner
issue, but on a network that relies heavily on ldap, a failed bdb
backend with a running slapd would cause significant issues.

Just trying to restart the slapd service doesn't fix the issue, a manual
recovery is required (slapd_db_recover).  I was curious if anyone has
put something in place to deal with this potential issue?  Maybe run
slapd_db_status via cron and if it errors due a bdb corruption, just
stop slapd and let the admin know.  At least the clients would be able
to failover to the other ldap servers.  I guess an automated recovery is
possible via a script, but I'm not sure if that's a good idea.  Maybe
dealing with this type of failure is not really required, I was hoping
that some of you that have been do this for a while would have some
insight.

Follow-Ups:
- Re: Dealing with BDB Crash
  - From: Mark <mah042@gmail.com>
- Re: Dealing with BDB Crash
  - From: ldap@mm.st
- LDAPCon?
  - From: Bill MacAllister <whm@stanford.edu>

References:
- Dealing with BDB Crash
  - From: ldap@mm.st

Prev by Date: Dealing with BDB Crash
Next by Date: FW: Authentication issue with syncrepl consumer
Index(es):
- Chronological
- Thread