[Date Prev][Date Next]
Re: Dealing with BDB Crash
Interesting and good information. We happen to use Big Brother/Xymon
for monitoring and have multiple scripts to check things like cache,
locks etc. We will get notified when these sense a problem, but at 1AM
on a Saturday getting notified and fixing the issue before all those
services get impacted is a little scary. That's why we were
contemplating that maybe it would be wise to "hit it with a hammer"
until we are able to intervene and repair.
On Wed, 30 Mar 2011 18:43 -0400, "Aaron Richton"
> So, the best defense is a good offense in this case, and if you were
> running 2.4.25 with the appropriate BerkeleyDB library you'd likely not
> see an issue of this manner.
> With that said, there was a time (with earlier releases of OpenLDAP) when
> we had this issue (one bdb go down, with the service apparently working
> via an overly simple smoke test). Not being fans of being bitten by the
> same failure mode twice, we wrote up a Nagios check that searches a
> known-present-on-disk entry that is in each of our databases. (You can
> either create one, or (ab)use "ou=People" if you're RFC2307 or use
> "cn=Manager" or what have you...) If any database doesn't return a hit,
> time for us to get a call.
> As an aside, I find this thoroughly fascinating timing. Not that it'll
> make you feel any better in the present case, but I was just considering
> writing something up for the next LDAPCon on how we do monitoring (there
> are ~10 angles we check from, many of them due to real life situations
> similar to yours). They're all relatively simple ideas like the above,
> I suppose cleaning up our code to the point where it's world-safe and
> getting something written up on it may be useful. They've proven
> occasionally useful for slapd(8) code issues and also, more frequently,
> useful in the face of human factors.
> On Wed, 30 Mar 2011, ldap@mm">firstname.lastname@example.org wrote:
> > A while ago I posted that we were having what we thought were random bdb
> > backend crashes with the following in our log:
> > bdb(o=example.com): PANIC: fatal region error detected; run recovery.
> > This was on a on our RH5 openldap servers (2.3.43) that we were
> > rebuilding:
> > It appears that the crashes were caused by a vulnerability scanner that
> > was hitting the server (still testing), even though it was suppose to be
> > safe. We'll have to investigate what is causing it, maybe we will need
> > an acl to stop whatever the scanner is doing. Once we stopped the
> > automated scan, the servers seem to be running as expected.
> > But, this brought up another issue. When the bdb backend failed, the
> > slapd process continued run and listen on the ldap ports and clients
> > still tried to connect to the failed server for authentication. The
> > server accepted and established the connection with the client. Of
> > course the client could not authenticate since the backend db was down.
> > The client will not fail over to the other server that is listed in it's
> > ldap.conf file since it thinks it has a valid connection. If the slap
> > process is not running then the fail over works fine since no ports are
> > there for the client to connect to.
> > I'm thinking that bdb failures will be rare once we solve the scanner
> > issue, but on a network that relies heavily on ldap, a failed bdb
> > backend with a running slapd would cause significant issues.
> > Just trying to restart the slapd service doesn't fix the issue, a manual
> > recovery is required (slapd_db_recover). I was curious if anyone has
> > put something in place to deal with this potential issue? Maybe run
> > slapd_db_status via cron and if it errors due a bdb corruption, just
> > stop slapd and let the admin know. At least the clients would be able
> > to failover to the other ldap servers. I guess an automated recovery is
> > possible via a script, but I'm not sure if that's a good idea. Maybe
> > dealing with this type of failure is not really required, I was hoping
> > that some of you that have been do this for a while would have some
> > insight.