8669 – Slapd service becomes unresponsive intermittently

Issue 8669 - Slapd service becomes unresponsive intermittently

Summary: Slapd service becomes unresponsive intermittently

Status:	VERIFIED SUSPENDED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.39
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-06-06 19:39 UTC by jmestrada69@gmail.com
Modified:	2020-03-23 15:00 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description jmestrada69@gmail.com 2017-06-06 19:39:39 UTC

Full_Name: JM Estrada
Version: 2.4.39
OS: RHEL Linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (209.136.235.13)


I am trying to determine if a problem we are having is a bug or some other issue
with OpenLDAP 2.4.39. 

We have two servers configured as a Master/Slave using syncrepl. Both servers
are running 2.4.39 and at random times, sometimes weeks apart, we are having
issues where the slapd service becomes unresponsive for a period of about 10 to
15 minutes.

When the problem occurs, we see numerous entries in the logs which show an
UNBIND and then a close entry. We then find that the slapd is unresponsive and
will not accept any requests, at the same time the CPU load for slapd skyrockets
to about 100% or very close to it. This lasts for about 10-15 minutes and then
the server recovers itself and again begins responding to requests. 

The problem is intermittent and doesn't seem to coincide with periods of heavy
use versus lower usage.

Is there a known bug with this version that could be causing this?

Comment 1 Quanah Gibson-Mount 2017-06-07 00:57:13 UTC

--On Tuesday, June 06, 2017 8:39 PM +0000 jmestrada69@gmail.com wrote:

> Is there a known bug with this version that could be causing this?

Hard to say.  It is 3.5 years old and 6 releases behind.  You don't state 
which backend you're using, which may be relevant as well.  There were 
known fragmentation issues with back-mdb in that release, for example, that 
could cause extensive pauses.  Without knowing significantly more about 
your system configuration, there's only a ton of speculation that can ensue.

You may want to see about using the builds from the LTB project 
(<http://ltb-project.org/wiki/download#openldap>), or if you require 
support for your deployment, Symas (my employer) offers packaged builds and 
various support options.

Regards,
Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 2 jmestrada69@gmail.com 2017-06-07 12:07:06 UTC

Quanah,

We are running the Berkley DB back-end, ³back-bdb² in the slapd.conf file.

Our server vendor did the upgrade to version 2.4.39 last year in April. In
asking them about upgrading to a newer version, as a potential fix, I was
told the last version in the RHEL repository that they can upgrade to is
2.4.40. I¹m not certain that our vendor will support our choice to upgrade
to a newer version than what RHEL provides them in the repository, but if
it will fix our problem, I¹ll have to push the envelope on that matter.

Were there any known fragmentation issues with back-bdb in the 2.4.39
version that could also be causing these pauses?

Initially, when we started having problems with the pausing, the server
would go offline for about 15-20 minutes then recover itself. The
developers had initially set the idletimeout to 8 minutes (480) and we
also noted that rsyslogd was constantly logging entries about the slapd
service, which stated that the the PID of the slapd service was losing
messages to the log due to rate-limiting. Rate limiting was enabled by
default for rsyslog so our vendor recommended to turn this off. At the
same time, when they did this we scaled back the idletimeout period to 5
minutes (300). This seemed to aggravated the problem. With the original
settings, we would encounter this ³pause² problem maybe once or twice in a
3 month period, and now after these changes were made we¹re seeing this
more frequently, although when it does pause it seems to only be for about
10 minutes, where it was pausing for 15-20 before. We currently have the
logging level set to the recommended ³256², but we¹re considering lowing
the logging level also.

Is it possible we have the idletimeout set too high and it should be
lowered? I¹m wondering if there is some sweet-spot value for this
particular setting.

The reason our developers had it set so high was because, in the past they
used to run some really long reports. I¹m pretty sure they do not run
these any longer.

I appreciate your feedback.

Thanks

On 6/6/17, 6:57 PM, "Quanah Gibson-Mount" <quanah@symas.com> wrote:

>--On Tuesday, June 06, 2017 8:39 PM +0000 jmestrada69@gmail.com wrote:
>
>> Is there a known bug with this version that could be causing this?
>
>Hard to say.  It is 3.5 years old and 6 releases behind.  You don't state
>which backend you're using, which may be relevant as well.  There were
>known fragmentation issues with back-mdb in that release, for example,
>that 
>could cause extensive pauses.  Without knowing significantly more about
>your system configuration, there's only a ton of speculation that can
>ensue.
>
>You may want to see about using the builds from the LTB project
>(<http://ltb-project.org/wiki/download#openldap>), or if you require
>support for your deployment, Symas (my employer) offers packaged builds
>and 
>various support options.
>
>Regards,
>Quanah
>
>--
>
>Quanah Gibson-Mount
>Product Architect
>Symas Corporation
>Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
><http://www.symas.com>
>

Comment 3 Michael Ströder 2017-06-07 12:16:48 UTC

jmestrada69@gmail.com wrote:
> Our server vendor did the upgrade to version 2.4.39 last year in April. In
> asking them about upgrading to a newer version, as a potential fix, I was
> told the last version in the RHEL repository that they can upgrade to is
> 2.4.40.

They seem to just recommend what seems to be the easiest choice for them and not what
would be the recommended choice for *you*. RHEL packages are heavily patched by Red Hat
and generally not recommended. The upstream developers cannot oversee what's the current
patch state of RHEL packages.

=> You should kick out your server vendor from doing the OpenLDAP support.

Ciao, Michael.

Comment 4 jmestrada69@gmail.com 2017-06-07 14:44:00 UTC

Yes, I've reached out to our vendor about this. I am hoping we can sidestep the RHEL releases. Thanks for the info on this. 

Sent from my iPhone

> On Jun 7, 2017, at 6:16 AM, Michael Ströder <michael@stroeder.com> wrote:
> 
> jmestrada69@gmail.com wrote:
>> Our server vendor did the upgrade to version 2.4.39 last year in April. In
>> asking them about upgrading to a newer version, as a potential fix, I was
>> told the last version in the RHEL repository that they can upgrade to is
>> 2.4.40.
> 
> They seem to just recommend what seems to be the easiest choice for them and not what
> would be the recommended choice for *you*. RHEL packages are heavily patched by Red Hat
> and generally not recommended. The upstream developers cannot oversee what's the current
> patch state of RHEL packages.
> 
> => You should kick out your server vendor from doing the OpenLDAP support.
> 
> Ciao, Michael.

Comment 5 Quanah Gibson-Mount 2017-06-07 16:27:02 UTC

--On Wednesday, June 07, 2017 7:07 AM -0600 Joaquin Estrada 
<jmestrada69@gmail.com> wrote:

> Quanah,
>
> We are running the Berkley DB back-end, ³back-bdb² in the slapd.conf
> file.
>
> Our server vendor did the upgrade to version 2.4.39 last year in April. In
> asking them about upgrading to a newer version, as a potential fix, I was
> told the last version in the RHEL repository that they can upgrade to is
> 2.4.40. I¹m not certain that our vendor will support our choice to
> upgrade to a newer version than what RHEL provides them in the
> repository, but if it will fix our problem, I¹ll have to push the
> envelope on that matter.
>
> Were there any known fragmentation issues with back-bdb in the 2.4.39
> version that could also be causing these pauses?

No, back-bdb is not remotely the same as back-mdb.  However, I've no idea 
what options RedHat compiles their BDB library with and there were specific 
options that had an effect on OpenLDAP.  Generally, I would note that the 
back-bdb backend and back-hdb backends are deprecated at this point.

> Is it possible we have the idletimeout set too high and it should be
> lowered? I¹m wondering if there is some sweet-spot value for this
> particular setting.

I generally leave it unset unless one is encountering an issue of running 
out of connections.  Generally, it would be fairly strange for idletimeout 
to affect things this way at all.  It simply drops idle connections based 
off of the timer.  Disabling rate throttling in rsyslogd is a good idea, 
but may be unrelated as well. We've also seen cases with RHEL7 where Redhat 
has set things up so that journald also gets all the syslog messages, which 
causes severe performance degredation.

You could spend some time seeing if you can isolate an exact cause.  For 
example, set loglevel to 0 and see if you still encounter the issue.  If 
you do, it is unrelated to syslog activity.

Another test would be to set idletimeout to 0.  If you still encounter the 
issue, it is unrelated to idle connections being dropped.  etc.

As Michael noted, Redhat builds are somewhat questionable as they make 
various changes to the code base that the OpenLDAP project have not been 
reviewed.  Your issues may or may not be related to such a change, it's 
generally impossible to know.

Hope that helps.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 6 Michael Ströder 2017-06-07 17:44:31 UTC

quanah@symas.com wrote:
>> Is it possible we have the idletimeout set too high and it should be
>> lowered? I=C2=B9m wondering if there is some sweet-spot value for this
>> particular setting.
> 
> I generally leave it unset unless one is encountering an issue of running
> out of connections.  Generally, it would be fairly strange for idletimeout
> to affect things this way at all.

I generally recommend to set idletimeout even somewhat tight in case you don't have a
strictly defined set of clients. Because a client application which does not use its LDAP
connection for ~5 min. is most times simply not closing connections. And running out of
file handles can affect all file creation on your system (e.g. creating BDB's transaction
log files).

Only the original poster can find out with monitoring.

One can find out stale connections via back-monitor in sub-tree
cn=Connections,cn=Monitor. IITC attribute 'monitorConnectionActivityTime' contains last
client access time on this connection.
(Ummh, I have to add this to my own monitoring script...)

And of course normal system monitoring of file handles would be also helpful.

Ciao, Michael.

(Keep repeating this mantra: monitoring, monitoring, monitoring, monitoring…)

Comment 7 Quanah Gibson-Mount 2017-06-07 17:58:28 UTC

--On Wednesday, June 07, 2017 8:44 PM +0200 Michael Ströder 
<michael@stroeder.com> wrote:

> quanah@symas.com wrote:
>>> Is it possible we have the idletimeout set too high and it should be
>>> lowered? I=C2=B9m wondering if there is some sweet-spot value for this
>>> particular setting.
>>
>> I generally leave it unset unless one is encountering an issue of running
>> out of connections.  Generally, it would be fairly strange for
>> idletimeout to affect things this way at all.
>
> I generally recommend to set idletimeout even somewhat tight in case you
> don't have a strictly defined set of clients. Because a client
> application which does not use its LDAP connection for ~5 min. is most
> times simply not closing connections. And running out of file handles can
> affect all file creation on your system (e.g. creating BDB's transaction
> log files).

Yep, there can be poorly written clients out there.  I'd expect idletimeout 
to be completely unrelated, given it's long standing existence and use. ;)

--Quanah


--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 8 Quanah Gibson-Mount 2020-03-23 15:00:22 UTC

back-bdb deprecated
Need further information to pursue