[Date Prev][Date Next] [Chronological] [Thread] [Top]

Strange hang scenario, resumes after idletimeout, but plenty of FDs available



I'm running into the following scenario. Shortly after slapd gets bombarded by a burst of operations (from several different clients) on existing connections (well under the max number of connections, about 3000 out of 16384), it suddenly hangs. It's not responsive to any new connections, and doesn't process operations on existing connections. Load average is near zero during this time, so it's not doing anything. After 20 minutes (idletimeout), slapd frees several connections (maybe say 1000), and resumes working again as if nothing happened.

The load pattern that gets it into this state happens every hour, almost on the hour (most likely associated with nslcd and cron jobs, which we're looking to mitigate elsewise). Another strange thing is that slapd will survive one instance's worth of bombardment without hanging, but the *next* hour will go into a hang state.

Are there any resources other than file descriptors that are freed up during the idletimeout processing? Are there any other parameters that can be tuned besides idletimeout here? Could it possibly be a case of deadlock somewhere, something grabbing all the locks? Would things like set_lk_max_locks be relevant to investigate here? Any log level settings that might reveal more of what's happening here?

Thanks for any suggestions on things to look at and try.

	-Kartik