[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Strange hang scenario, resumes after idletimeout, but plenty of FDs available

Subject: Re: Strange hang scenario, resumes after idletimeout, but plenty of FDs available
From: David Hawes <dhawes@vt.edu>
Date: Thu, 02 Jun 2011 12:02:50 -0400
Cc: openldap-technical@openldap.org
In-reply-to: <4DE633E1.4090508@computer.org>
References: <4DE633E1.4090508@computer.org>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10

On 06/01/2011 08:43 AM, Kartik Subbarao wrote:

I'm running into the following scenario. Shortly after slapd gets
bombarded by a burst of operations (from several different clients) on
existing connections (well under the max number of connections, about
3000 out of 16384), it suddenly hangs. It's not responsive to any new
connections, and doesn't process operations on existing connections.
Load average is near zero during this time, so it's not doing anything.
After 20 minutes (idletimeout), slapd frees several connections (maybe
say 1000), and resumes working again as if nothing happened.

The load pattern that gets it into this state happens every hour, almost
on the hour (most likely associated with nslcd and cron jobs, which
we're looking to mitigate elsewise). Another strange thing is that slapd
will survive one instance's worth of bombardment without hanging, but
the *next* hour will go into a hang state.

Are there any resources other than file descriptors that are freed up
during the idletimeout processing? Are there any other parameters that
can be tuned besides idletimeout here? Could it possibly be a case of
deadlock somewhere, something grabbing all the locks? Would things like
set_lk_max_locks be relevant to investigate here? Any log level settings
that might reveal more of what's happening here?

I have noticed similar behavior on a handful of occasions with 2.4.23and bdb-4.7.25p4.

When this happens, the last log entry I typically see is a search thatmisses the indexes (e.g. (mail=*a*)).


The server has the default idletimeout (disabled).

I have as yet been unable to force the hang, though I have not triedheavier loads with SLAMD. It has also been a while since I have seenthis, so I do not have a stacktrace handy.

I just wanted to add this anecdotal evidence of the hang. I hope atsome point I'll be able to get a working stacktrace. Of course, Ishould also try newer versions of OpenLDAP and BDB.

References:
- Strange hang scenario, resumes after idletimeout, but plenty of FDs available
  - From: Kartik Subbarao <subbarao@computer.org>

Prev by Date: Re: Strange hang scenario, resumes after idletimeout, but plenty of FDs available
Next by Date: Re: Strange hang scenario, resumes after idletimeout, but plenty of FDs available
Index(es):
- Chronological
- Thread