6322 – slapd suddenly stops working, and starts using 100% CPU

Issue 6322 - slapd suddenly stops working, and starts using 100% CPU

Summary: slapd suddenly stops working, and starts using 100% CPU

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-02 09:50 UTC by helge@monsternett.no
Modified:	2017-03-27 23:51 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description helge@monsternett.no 2009-10-02 09:50:15 UTC

Full_Name: Helge Milde
Version: slapd 2.4.11
OS: Debian 5
URL: 
Submission from: (NULL) (81.175.2.164)


Hello,

In the last two days, slapd has randomly stopped working on our server, about
one time every two or three hours. Before this, slapd had been running without
any problems for about two months. slapd.conf has not been touched since then.

What happens is that slapd stops taking queries (postfix, dovecot, NSS etc.
starts getting timeouts), and it starts using exactly 100% of one core until we
restart it. 
After restart, everything works again (until this happens again).

It seems to only happen when slapd is under "stress" (we have an 8-core system
with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU
load I've seen is about 40% on one core). The last ldap search before slapd
stops working is always the same:
Oct  2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH
base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0
filter="(&(objectClass=posixGroup))"
.. which should return about 1800 entries. Different connections might query
this simultaneously, and it's a query that happens every ~10 seconds on
average.

I've tried to reproduce this by doing ..
# while [ 1 ]; do ldapsearch -x -H ldaps://.. -b
'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done
.. for two minutes without any "luck" :). slapd just uses a bit more CPU, but
never stops working.

With "loglevel stats stats2", I get nothing out of the logs:
Oct  2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
... time of "crash" (we're mid-search at this point); nothing gets logged until
we do a restart: ... 
Oct  2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated.
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown)
... many entires like the one above, and then I seem to be getting the next
entry from the same connection, right before it closes: ... 
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown)

I've tried logging more than just stats, but the interesting loglevels also
causes a big performance hit (trace and args), and I'm not really sure which of
the other loglevels could be helpful.
This is a live ldap server, on a server that also hosts other services, so I
simply can't take the performance hit.

I also did a strace -p <slapd pid>, which only gives "futex(0x41c499e0,
FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be
useful, so here it is: http://www.pastie.org/639170

I understand that this probably isn't enough data to see what the problem is,
and so I'm hoping to get some help in debugging this properly. As I've said,
args/trace loglevel takes too much of a performance hit, but if this is the only
way, we might reconcider.

Interesting parts of slapd.conf below.
We use slapd.conf instead of cn=config, everything goes over TLS, we're AFAIK
only using LDAPv3 binding, and we only have one database entry in slapd.conf.

moduleload  back_hdb
sizelimit 99999
tool-threads 8
threads 16
concurrency 32
backend     hdb 
database        hdb
cachesize 60000
dbconfig set_cachesize 0 52428800 0
dbconfig set_lk_max_objects 1500
dbconfig set_lk_max_locks 1500
dbconfig set_lk_max_lockers 1500
lastmod         on
checkpoint      512 30

Best regards,
Helge Milde

Comment 1 Howard Chu 2009-10-02 20:54:13 UTC

helge@monsternett.no wrote:
> Full_Name: Helge Milde
> Version: slapd 2.4.11
> OS: Debian 5
> URL:
> Submission from: (NULL) (81.175.2.164)
>
>
> Hello,
>
> In the last two days, slapd has randomly stopped working on our server, about
> one time every two or three hours. Before this, slapd had been running without
> any problems for about two months. slapd.conf has not been touched since then.
>
> What happens is that slapd stops taking queries (postfix, dovecot, NSS etc.
> starts getting timeouts), and it starts using exactly 100% of one core until we
> restart it.
> After restart, everything works again (until this happens again).
>
> It seems to only happen when slapd is under "stress" (we have an 8-core system
> with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU
> load I've seen is about 40% on one core). The last ldap search before slapd
> stops working is always the same:
> Oct  2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH
> base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0
> filter="(&(objectClass=posixGroup))"
> .. which should return about 1800 entries. Different connections might query
> this simultaneously, and it's a query that happens every ~10 seconds on
> average.
>
> I've tried to reproduce this by doing ..
> # while [ 1 ]; do ldapsearch -x -H ldaps://.. -b
> 'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done
> .. for two minutes without any "luck" :). slapd just uses a bit more CPU, but
> never stops working.
>
> With "loglevel stats stats2", I get nothing out of the logs:
> Oct  2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
> dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
> ... time of "crash" (we're mid-search at this point); nothing gets logged until
> we do a restart: ...
> Oct  2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated.
> Oct  2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown)
> ... many entires like the one above, and then I seem to be getting the next
> entry from the same connection, right before it closes: ...
> Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
> dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
> Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown)
>
> I've tried logging more than just stats, but the interesting loglevels also
> causes a big performance hit (trace and args), and I'm not really sure which of
> the other loglevels could be helpful.
> This is a live ldap server, on a server that also hosts other services, so I
> simply can't take the performance hit.
>
> I also did a strace -p<slapd pid>, which only gives "futex(0x41c499e0,
> FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be
> useful, so here it is: http://www.pastie.org/639170

strace is useless. Use gdb and get a trace of all running threads when this 
occurs.

http://www.openldap.org/faq/data/cache/59.html

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 Quanah Gibson-Mount 2009-10-02 21:27:54 UTC

--On Friday, October 02, 2009 8:54 PM +0000 hyc@symas.com wrote:


> strace is useless. Use gdb and get a trace of all running threads when
> this  occurs.
>
> http://www.openldap.org/faq/data/cache/59.html


You also need to test against a current release of OpenLDAP, like OpenLDAP 
2.4.18.

--Quanah


--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 3 Howard Chu 2009-11-11 00:08:23 UTC

changed notes
changed state Open to Closed

Comment 4 helge@monsternett.no 2009-12-10 09:44:38 UTC

I have now installed openldap 2.4.19 from source (default configuration except for --enable-crypt=yes).

# slapd -V
@(#) $OpenLDAP: slapd 2.4.19 (Dec  9 2009 22:46:15) $

But unfortunately, the bug is still there. Nothing has changed at all..
It has crashed 10 times since I did the migration 10 hours ago. It usually crashes 1-2 times at night, and 10-20 times during work hours (when the servers have more load).

When I migrated, I recreated the database by typing:
# oldbin/slapcat > db.ldif
# newbin/slapadd < db.ldif # (under the new openldap/slapd user)
So I can't imagine the database can be corrupt in any way, which I initially thought when I first sent this bug report.

It still dies on the same queries as before, in the middle of iterating through ou=users,.. or ou=groups,.., which both have ~1.8k entries. These queries comes from proftpd, which does a "getent passwd" and "getent group" every time a customer logs in.
I tried to reproduce this manually again, launching 8 processes, constantly querying ou=users and ou=groups, 2 using a unix socket locally and 6 using ldaps://hostname/ remotely, but it still won't break down when I do that. This causes *alot* more load than proftpd does, but the crashes seem to only happen "randomly".

Any suggestions on what I can do to figure out this problem? The LDAP server is live and in use in two servers with 2-3k users, so I can't mess with it *too* much.

On Fri, Oct 02, 2009 at 09:28:25PM +0000, quanah@zimbra.com wrote:
>--On Friday, October 02, 2009 8:54 PM +0000 hyc@symas.com wrote:
>
>
>> strace is useless. Use gdb and get a trace of all running threads when
>> this  occurs.
>>
>> http://www.openldap.org/faq/data/cache/59.html
>
>
>You also need to test against a current release of OpenLDAP, like OpenLDAP
>2.4.18.
>
>--Quanah
>
>
>--
>
>Quanah Gibson-Mount
>Principal Software Engineer
>Zimbra, Inc
>--------------------
>Zimbra ::  the leader in open source messaging and collaboration
>
>

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 5 Quanah Gibson-Mount 2009-12-10 16:21:20 UTC

--On Thursday, December 10, 2009 10:44 AM +0100 Helge Milde 
<helge@monsternett.no> wrote:

> I have now installed openldap 2.4.19 from source (default configuration
> except for --enable-crypt=yes).
>
># slapd -V
> @(#) $OpenLDAP: slapd 2.4.19 (Dec  9 2009 22:46:15) $
>
> But unfortunately, the bug is still there. Nothing has changed at all..
> It has crashed 10 times since I did the migration 10 hours ago. It
> usually crashes 1-2 times at night, and 10-20 times during work hours
> (when the servers have more load).

Well, now 2.4.20 is current (just to note).

What we'll need for resolving your issue is:

(a) Your configuration files (minus passwords)
(b) A backtrace from gdb of a crash.  Make sure you build OpenLDAP with 
debugging symbols (using CFLAGS of -g) and when you run make install, make 
sure you use STRIP="", like:

make install DESTDIR=/my/path/to/install STRIP=""

Thanks,
Quanah


--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 6 helge@monsternett.no 2009-12-11 10:46:46 UTC

On Thu, Dec 10, 2009 at 08:21:20AM -0800, Quanah Gibson-Mount wrote:
>Well, now 2.4.20 is current (just to note).

Right. I'm now running this version.
The results below are with 2.4.20, and not 2.4.19.

>What we'll need for resolving your issue is:
>
>(a) Your configuration files (minus passwords)

# egrep -v '^#|^$' etc/openldap/slapd.conf
TLSCertificateFile <removed>.crt
TLSCertificateKeyFile <removed>.key
TLSCACertificateFile <removed>.ca
include /usr/local/ldap/etc/openldap/schema/core.schema
include /usr/local/ldap/etc/openldap/schema/cosine.schema
include /usr/local/ldap/etc/openldap/schema/nis.schema
include /usr/local/ldap/etc/openldap/schema/qmail.schema
include /usr/local/ldap/etc/openldap/schema/inetorgperson.schema
include /usr/local/ldap/etc/openldap/schema/mod_vhost_ldap.schema
include /usr/local/ldap/etc/openldap/schema/monsternett_vhost_ldap.schema
include /usr/local/ldap/etc/openldap/schema/sql_user.schema
include /usr/local/ldap/etc/openldap/schema/using_space.schema
pidfile         /usr/local/ldap/var/run/slapd/slapd.pid
argsfile        /usr/local/ldap/var/run/slapd/slapd.args
loglevel stats
modulepath      /usr/local/ldap/lib
moduleload      back_hdb
sizelimit 99999
tool-threads 8
threads 16
concurrency 32
backend         hdb
database        hdb
suffix          "dc=<removed>"
directory       "/usr/local/ldap/var/openldap-data"
cachesize 60000
dbconfig set_cachesize 0 52428800 0
dbconfig set_lk_max_objects 1500
dbconfig set_lk_max_locks 1500
dbconfig set_lk_max_lockers 1500
index objectclass,sn eq
index cn,uniqueMember,memberUid,uidNumber,gidNumber pres,eq
index uid,mail,associatedDomain,mailAlternateAddress,apacheServerName,apacheServerAlias,apacheSuexecUid pres,eq,sub
lastmod         on
checkpoint      512 30
access to attrs=userPassword,shadowLastChange
         by dn="cn=admin,dc=<removed>" write
         by anonymous auth
         by self write
         by * none
access to dn.base="" by * read
access to *
         by dn="cn=admin,dc=<removed>" write
         by * read


>(b) A backtrace from gdb of a crash.

Since this is a *live* server, I don't want to run it through gdb if I don't have to. Also, the server doesn't really crash, it just hangs, using 100% of one CPU core.
We currently have a script that checks if ldapsearch succeeds, and if not, it restarts slapd. AFAIK, if we would have done this via gdb directly, the ports would still be open, and slapd wouldn't be able to start, which means I would have to sit there for an hour to quickly be able to run a backtrace, so slapd could be restarted instantly :)
Anyway, instead, I made it dump a core (by sending killall -11 slapd in my script). I hope this is sufficient.

--- gdb core bt start ---
(gdb) thread apply all bt

Thread 18 (process 7691):
#0  0x00007ff212f77b78 in epoll_wait () from /lib/libc.so.6
#1  0x000000000041c618 in slapd_daemon_task (ptr=<value optimized out>) at daemon.c:2465
#2  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#3  0x00007ff212f775ad in clone () from /lib/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 17 (process 7692):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x410f6670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x2732490, rs=0x41277ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x2732490, rs=0x41277ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x2732490, rs=0x41277ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x2732490, rs=0x41277ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x41277df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x41277df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 16 (process 7694):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x418f7670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x285ba50, rs=0x41a78ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x285ba50, rs=0x41a78ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x285ba50, rs=0x41a78ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x285ba50, rs=0x41a78ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x41a78df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x41a78df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 15 (process 7695):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x42947670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1e40163b0, rs=0x42ac8ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1e40163b0, rs=0x42ac8ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1e40163b0, rs=0x42ac8ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1e40163b0, rs=0x42ac8ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x42ac8df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x42ac8df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 14 (process 7699):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x43148670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1f42babf0, rs=0x432c9ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1f42babf0, rs=0x432c9ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1f42babf0, rs=0x432c9ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1f42babf0, rs=0x432c9ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x432c9df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x432c9df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 13 (process 7700):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x43949670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x16e57b0, rs=0x43acaca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x16e57b0, rs=0x43acaca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x16e57b0, rs=0x43acaca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x16e57b0, rs=0x43acaca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x43acadf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x43acadf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 12 (process 7715):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4414a670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1f42e7260, rs=0x442cbca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1f42e7260, rs=0x442cbca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1f42e7260, rs=0x442cbca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1f42e7260, rs=0x442cbca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x442cbdf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x442cbdf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 11 (process 7864):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4494b670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x1b731b0, rs=0x44accca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x1b731b0, rs=0x44accca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x1b731b0, rs=0x44accca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x1b731b0, rs=0x44accca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x44accdf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x44accdf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 10 (process 8395):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4514c670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1e4038380, rs=0x452cdca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1e4038380, rs=0x452cdca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1e4038380, rs=0x452cdca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1e4038380, rs=0x452cdca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x452cddf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x452cddf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 9 (process 8808):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4594d670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1fcac89e0, rs=0x45aceca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1fcac89e0, rs=0x45aceca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1fcac89e0, rs=0x45aceca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1fcac89e0, rs=0x45aceca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x45acedf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x45acedf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 8 (process 9706):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4614e670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x414b4d0, rs=0x462cfca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x414b4d0, rs=0x462cfca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x414b4d0, rs=0x462cfca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x414b4d0, rs=0x462cfca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x462cfdf0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x462cfdf0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 7 (process 10771):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x4694f670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x414c100, rs=0x46ad0ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x414c100, rs=0x46ad0ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x414c100, rs=0x46ad0ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x414c100, rs=0x46ad0ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x46ad0df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x46ad0df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 6 (process 11742):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x47150670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1f42e8640, rs=0x472d1ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1f42e8640, rs=0x472d1ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1f42e8640, rs=0x472d1ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1f42e8640, rs=0x472d1ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x472d1df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x472d1df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 5 (process 11743):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x47951670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x414c610, rs=0x47ad2ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x414c610, rs=0x47ad2ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x414c610, rs=0x47ad2ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x414c610, rs=0x47ad2ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x47ad2df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x47ad2df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 4 (process 18180):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x48152670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x4376ee0, rs=0x482d3ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x4376ee0, rs=0x482d3ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x4376ee0, rs=0x482d3ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x4376ee0, rs=0x482d3ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x482d3df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x482d3df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 3 (process 18190):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x48953670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1f42d4510, rs=0x48ad4ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1f42d4510, rs=0x48ad4ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1f42d4510, rs=0x48ad4ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1f42d4510, rs=0x48ad4ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x48ad4df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x48ad4df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 2 (process 18200):
#0  0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x49154670) at result.c:225
#2  0x0000000000430d00 in slap_send_search_entry (op=0x7ff1f42e78e0, rs=0x492d5ca0) at result.c:1260
#3  0x000000000049df00 in hdb_search (op=0x7ff1f42e78e0, rs=0x492d5ca0) at search.c:961
#4  0x0000000000421491 in fe_op_search (op=0x7ff1f42e78e0, rs=0x492d5ca0) at search.c:366
#5  0x0000000000421c8c in do_search (op=0x7ff1f42e78e0, rs=0x492d5ca0) at search.c:217
#6  0x000000000041efff in connection_operation (ctx=0x492d5df0, arg_v=<value optimized out>) at connection.c:1109
#7  0x000000000041fcdd in connection_read_thread (ctx=0x492d5df0, argv=<value optimized out>) at connection.c:1245
#8  0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685
#9  0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007ff212f775ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 1 (process 7690):
#0  0x00007ff214c50715 in pthread_join () from /lib/libpthread.so.0
#1  0x0000000000419c4c in slapd_daemon () at daemon.c:2840
#2  0x0000000000407d8a in main (argc=7, argv=0x7fff1d3caa88) at main.c:961
--- gdb core bt end ---

I guess the last part in the treads are caused by me sending a SIGSEGV, i.e. every thread was inside clone () after slapd started hanging?
Well, I'll leave the analysis to the experts I think :-)

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 7 helge@monsternett.no 2009-12-29 08:10:04 UTC

Would be nice to get some feedback on this one..
Were using a core dump sufficient? If so, was the information helpful at all?
If there's anything more I could do, please say so.

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 8 Howard Chu 2009-12-29 19:00:25 UTC

helge@monsternett.no wrote:
> Would be nice to get some feedback on this one..
> Were using a core dump sufficient? If so, was the information helpful at all?
> If there's anything more I could do, please say so.
>
Sorry, your issue was closed since we hadn't received any feedback from you in 
a couple of months.

The last trace you sent shows nothing unusual, just a bunch of threads all 
waiting to write on their respective connections.

Have you tried the .21 release yet?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 9 Howard Chu 2009-12-29 19:00:35 UTC

changed notes
changed state Closed to Open

Comment 10 helge@monsternett.no 2009-12-30 08:04:12 UTC

On Tue, Dec 29, 2009 at 11:00:25AM -0800, Howard Chu wrote:
>Sorry, your issue was closed since we hadn't received any feedback 
>from you in a couple of months.
>
>The last trace you sent shows nothing unusual, just a bunch of 
>threads all waiting to write on their respective connections.

If the core dump backtrace wasn't helpful, then please tell me so; I thought I had given you sufficient feedback, and that you were working on it.

>Have you tried the .21 release yet?

No.. I might do that next week. Is there anything else that I can do?
I just enabled trace loglevel, perhaps that might give us a clue on what's going on.

>-- 
>  -- Howard Chu
>  CTO, Symas Corp.           http://www.symas.com
>  Director, Highland Sun     http://highlandsun.com/hyc/
>  Chief Architect, OpenLDAP  http://www.openldap.org/project/

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 11 helge@monsternett.no 2009-12-30 13:08:00 UTC

On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote:
>I just enabled trace loglevel, perhaps that might give us a clue on what's going on.

Slapd just stopped working again. Here's the (annotated) log for viewing and/or downloading:
http://www.pastie.org/761198.txt

Does this give you any clues?

Tell me if I can do anything else to help.

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 12 Quanah Gibson-Mount 2010-01-05 21:39:40 UTC

--On Wednesday, December 30, 2009 1:09 PM +0000 helge@monsternett.no wrote:

> On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote:
>> I just enabled trace loglevel, perhaps that might give us a clue on
>> what's going on.
>
> Slapd just stopped working again. Here's the (annotated) log for viewing
> and/or downloading: http://www.pastie.org/761198.txt
>
> Does this give you any clues?
>
> Tell me if I can do anything else to help.

A fix for this has been committed to HEAD.  You can apply it to the latest 
source (2.4.21) if you would like to try it out:

<http://www.openldap.org/devel/cvsweb.cgi/servers/slapd/daemon.c.diff?hideattic=1&r1=text&tr1=1.442&r2=text&tr2=1.443&f=u>

--Quanah


--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 13 Quanah Gibson-Mount 2010-01-05 21:46:23 UTC

--On Tuesday, January 05, 2010 1:39 PM -0800 Quanah Gibson-Mount 
<quanah@zimbra.com> wrote:

> --On Wednesday, December 30, 2009 1:09 PM +0000 helge@monsternett.no
> wrote:
>
>> On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote:
>>> I just enabled trace loglevel, perhaps that might give us a clue on
>>> what's going on.
>>
>> Slapd just stopped working again. Here's the (annotated) log for viewing
>> and/or downloading: http://www.pastie.org/761198.txt
>>
>> Does this give you any clues?
>>
>> Tell me if I can do anything else to help.
>
> A fix for this has been committed to HEAD.  You can apply it to the
> latest source (2.4.21) if you would like to try it out:
>
> <http://www.openldap.org/devel/cvsweb.cgi/servers/slapd/daemon.c.diff?hid
> eattic=1&r1=text&tr1=1.442&r2=text&tr2=1.443&f=u>

Hm, correction, it fixes a minor shutdown issue, not your main issue. :/

--Quanah

--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 14 helge@monsternett.no 2010-02-15 08:53:54 UTC

Hello again.
We randomly tried to switch to LDAP without SSL last Friday (we used ldaps://..:636/ before), and the server hasn't crashed since!
This is a good enough workaround for us, but I hope this information can help with fixing this bug, or help others with the same problem.

On Wed, Dec 30, 2009 at 02:08:00PM +0100, Helge Milde wrote:
>On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote:
>>I just enabled trace loglevel, perhaps that might give us a clue on what's going on.
>
>Slapd just stopped working again. Here's the (annotated) log for viewing and/or downloading:
>http://www.pastie.org/761198.txt
>
>Does this give you any clues?
>
>Tell me if I can do anything else to help.
>
>-- 
>Helge Milde, 69701808
>www.monsternett.no

-- 
Helge Milde, 69701808
www.monsternett.no

Comment 15 OpenLDAP project 2017-03-27 23:51:05 UTC

SSL library issue.  No idea which SSL library was in use.

Comment 16 Quanah Gibson-Mount 2017-03-27 23:51:05 UTC

changed notes
changed state Open to Closed