[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#4387) slapd-ldap backend leaks descriptors on closed connections on x86_64



Full_Name: Aleksander Adamowski
Version: 2.2.29, 2.3.19
OS: Fedora Core 4
URL: http://infra.altkom.pl/openldap/slapd-ldap_descriptor_leak-configs.tar.gz
Submission from: (NULL) (85.128.15.81)


In our company, we use OpenLDAP for our main mail server (they are closely
integrated).

Some time ago, to workaround BDB stability problems in OpenLDAP 2.1, we've
devised a scheme where there are 4 instances of slapd running  on 2 physical
machines, 2 instances per machine.

The BDB instance would store the actual database on disk using the slapd-bdb
backend, and would listen on non-standard ports (different than 389/636).

The LDAP instance would proxy the BDB instance, and would listen on standard
ports (389 and 636), and would forward all queries to BDB instance, and if it
doesn't work, to another BDB instance on a backup machine (using two values for
the "uri" configuration attribute, separated with a space).

The backup machine would have a similar scheme (LDAP instance proxying a BDB
instance listening on a non standard port).

Primary BDB would be replicated to the secondary BDB instance using slurpd.

All worked well, but after migrating the whole configuration from Fedora Core 1
running on x86 (an SMP Xeon system) to Fedora Core 4 running on x86_64 (an SMP
dual core Opteron system), we've found out that the LDAP instance that uses only
slapd-ldap backend starts leaking descriptors from its client connections to the
BDB instance. When we switch to using the BDB instance directly, all is OK
(slapd-bdb doesn't leak any descriptors).

After slapd-ldap process reaches a 1024 descriptor limit (a static per-process
limit compiled into kernel), it stops working and logs the following error to
syslog:

slapd[31619]: daemon: 1026 beyond descriptor table size 1024
.....
slapd[31619]: daemon: 1027 beyond descriptor table size 1024
... etc.

All global system limits on open file descriptors, as well as per-user limits
are set high above 100000 of descriptors, but the 1024 limit is compiled in.
Besides, raising it makes no sense since slapd-ldap will allocate as many
desciptors as it can.

Concerning a similar problem, I've read this thread on the mailin list:
http://www.openldap.org/lists/openldap-software/200303/msg00865.html.

According to that thread, I've set the idle connection timeout for the
slapd-ldap instance to as low as 16 seconds, but there is still something wrong,
long after switching all clients to use the BDB instance on non-standard port,
when there are no queries sent to the LDAP instance for several minutes, it
still holds descriptors for those connections and doesn't free them. It should
close them after 16 seconds of inactivity, but it doesn't. Those connections are
still established, as visible in output from "netstat -anp" (port 392 is the
listening port of the BDB instance):

tcp        0      0 127.0.0.1:43189             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43188             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43193             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43195             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43197             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43196             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43198             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43137             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43139             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43138             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43141             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43140             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43143             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43142             127.0.0.1:392              
ESTABLISHED 17220/slapd         
tcp        0      0 127.0.0.1:43147             127.0.0.1:392              
ESTABLISHED 17220/slapd         


This can also be seen by looking up the PID of the slapd process of slapd-ldap
instance, and looking into its /proc/<PID>/fd. There are 1024 entries there, and
they don't disappear over several minutes - only after slapd restart.

So I've concluded that there's something wrong with the slapd-ldap instance:

1) While operating, slapd-ldap connection caching logic (described in slapd-ldap
manpage) doesn't work properly - it opens much more connections to the proxied
slapd-bdb instance, than needed for sharing cached connections. It seems
connections aren't reused. There definitely aren't more that 30 simultaneous
queries executed, and most of the time there are less than 10.

2) After queries stop coming in to slapd-ldap instance, it doesn't timeout idle
connections to the proxied slapd-bdb instance after configured 16 seconds. They
are kept indefinitely for several minutes, until the instance is stopped
forcibly.

3) If the number of connections hits the per-process 1024 descriptor limit and
tries to exceed it, there's also a problem when the slapd-ldap instance
terminates. When it receives the TERM signal, it gets stuck waiting for some
threads to terminate:

....
Jan 30 10:37:35 nmail slapd[16859]: conn=710 fd=999 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=714 fd=1003 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=717 fd=1008 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=723 fd=1010 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=727 fd=1015 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=728 fd=1017 closed 
Jan 30 10:37:35 nmail slapd[16859]: conn=730 fd=1019 closed 
Jan 30 10:37:35 nmail slapd[16859]: slapd shutdown: waiting for 45 threads to
terminate 

It doesn't exit until forcibly killed with SIGKILL.

A package with our config files (with anonymized passwords and addresses) is
available under the URL provided with this ITS.