[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Strange hang scenario, resumes after idletimeout, but plenty of FDs available



On 06/01/2011 08:43 AM, Kartik Subbarao wrote:
I'm running into the following scenario. Shortly after slapd gets
bombarded by a burst of operations (from several different clients) on
existing connections (well under the max number of connections, about
3000 out of 16384), it suddenly hangs. It's not responsive to any new
connections, and doesn't process operations on existing connections.
Load average is near zero during this time, so it's not doing anything.
After 20 minutes (idletimeout), slapd frees several connections (maybe
say 1000), and resumes working again as if nothing happened.

The load pattern that gets it into this state happens every hour, almost
on the hour (most likely associated with nslcd and cron jobs, which
we're looking to mitigate elsewise). Another strange thing is that slapd
will survive one instance's worth of bombardment without hanging, but
the *next* hour will go into a hang state.

Are there any resources other than file descriptors that are freed up
during the idletimeout processing? Are there any other parameters that
can be tuned besides idletimeout here? Could it possibly be a case of
deadlock somewhere, something grabbing all the locks? Would things like
set_lk_max_locks be relevant to investigate here? Any log level settings
that might reveal more of what's happening here?

I have noticed similar behavior on a handful of occasions with 2.4.23 and bdb-4.7.25p4.

When this happens, the last log entry I typically see is a search that misses the indexes (e.g. (mail=*a*)).

The server has the default idletimeout (disabled).

I have as yet been unable to force the hang, though I have not tried heavier loads with SLAMD. It has also been a while since I have seen this, so I do not have a stacktrace handy.

I just wanted to add this anecdotal evidence of the hang. I hope at some point I'll be able to get a working stacktrace. Of course, I should also try newer versions of OpenLDAP and BDB.