[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Dealing with BDB Crash



Interesting and good information.  We happen to use Big Brother/Xymon
for monitoring and have multiple scripts to check things like cache,
locks etc.  We will get notified when these sense a problem, but at 1AM
on a Saturday getting notified and fixing the issue before all those
services get impacted is a little scary.  That's why we were
contemplating that maybe it would be wise to "hit it with a hammer"
until we are able to intervene and repair.

On Wed, 30 Mar 2011 18:43 -0400, "Aaron Richton"
<richton@nbcs.rutgers.edu> wrote:
> So, the best defense is a good offense in this case, and if you were 
> running 2.4.25 with the appropriate BerkeleyDB library you'd likely not 
> see an issue of this manner.
> 
> With that said, there was a time (with earlier releases of OpenLDAP) when 
> we had this issue (one bdb go down, with the service apparently working 
> via an overly simple smoke test). Not being fans of being bitten by the 
> same failure mode twice, we wrote up a Nagios check that searches a 
> known-present-on-disk entry that is in each of our databases. (You can 
> either create one, or (ab)use "ou=People" if you're RFC2307 or use 
> "cn=Manager" or what have you...) If any database doesn't return a hit, 
> time for us to get a call.
> 
> As an aside, I find this thoroughly fascinating timing. Not that it'll 
> make you feel any better in the present case, but I was just considering 
> writing something up for the next LDAPCon on how we do monitoring (there 
> are ~10 angles we check from, many of them due to real life situations 
> similar to yours). They're all relatively simple ideas like the above,
> but 
> I suppose cleaning up our code to the point where it's world-safe and 
> getting something written up on it may be useful. They've proven 
> occasionally useful for slapd(8) code issues and also, more frequently, 
> useful in the face of human factors.
> 
> On Wed, 30 Mar 2011, ldap@mm">ldap@mm.st wrote:
> 
> > A while ago I posted that we were having what we thought were random bdb
> > backend crashes with the following in our log:
> >
> > bdb(o=example.com): PANIC: fatal region error detected; run recovery.
> >
> > This was on a on our RH5 openldap servers (2.3.43) that we were
> > rebuilding:
> >
> > It appears that the crashes were caused by a vulnerability scanner that
> > was hitting the server (still testing), even though it was suppose to be
> > safe.  We'll have to investigate what is causing it, maybe we will need
> > an acl to stop whatever the scanner is doing.  Once we stopped the
> > automated scan, the servers seem to be running as expected.
> >
> > But, this brought up another issue.  When the bdb backend failed, the
> > slapd process continued run and listen on the ldap ports and clients
> > still tried to connect to the failed server for authentication.  The
> > server accepted and established the connection with the client.  Of
> > course the client could not authenticate since the backend db was down.
> > The client will not fail over to the other server that is listed in it's
> > ldap.conf file since it thinks it has a valid connection.  If the slap
> > process is not running then the fail over works fine since no ports are
> > there for the client to connect to.
> >
> > I'm thinking that bdb failures will be rare once we solve the scanner
> > issue, but on a network that relies heavily on ldap, a failed bdb
> > backend with a running slapd would cause significant issues.
> >
> > Just trying to restart the slapd service doesn't fix the issue, a manual
> > recovery is required (slapd_db_recover).  I was curious if anyone has
> > put something in place to deal with this potential issue?  Maybe run
> > slapd_db_status via cron and if it errors due a bdb corruption, just
> > stop slapd and let the admin know.  At least the clients would be able
> > to failover to the other ldap servers.  I guess an automated recovery is
> > possible via a script, but I'm not sure if that's a good idea.  Maybe
> > dealing with this type of failure is not really required, I was hoping
> > that some of you that have been do this for a while would have some
> > insight.
> >
>