Logged in as guest
Viewing Incoming/6322 Full headers
Major security issue: yes no
Notes: Notification:
Date: Fri, 02 Oct 2009 09:50:15 +0000 From: helge@monsternett.no To: openldap-its@OpenLDAP.org Subject: slapd suddenly stops working, and starts using 100% CPU
Full_Name: Helge Milde Version: slapd 2.4.11 OS: Debian 5 URL: Submission from: (NULL) (81.175.2.164) Hello, In the last two days, slapd has randomly stopped working on our server, about one time every two or three hours. Before this, slapd had been running without any problems for about two months. slapd.conf has not been touched since then. What happens is that slapd stops taking queries (postfix, dovecot, NSS etc. starts getting timeouts), and it starts using exactly 100% of one core until we restart it. After restart, everything works again (until this happens again). It seems to only happen when slapd is under "stress" (we have an 8-core system with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU load I've seen is about 40% on one core). The last ldap search before slapd stops working is always the same: Oct 2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0 filter="(&(objectClass=posixGroup))" .. which should return about 1800 entries. Different connections might query this simultaneously, and it's a query that happens every ~10 seconds on average. I've tried to reproduce this by doing .. # while [ 1 ]; do ldapsearch -x -H ldaps://.. -b 'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done .. for two minutes without any "luck" :). slapd just uses a bit more CPU, but never stops working. With "loglevel stats stats2", I get nothing out of the logs: Oct 2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" ... time of "crash" (we're mid-search at this point); nothing gets logged until we do a restart: ... Oct 2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated. Oct 2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown) ... many entires like the one above, and then I seem to be getting the next entry from the same connection, right before it closes: ... Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown) I've tried logging more than just stats, but the interesting loglevels also causes a big performance hit (trace and args), and I'm not really sure which of the other loglevels could be helpful. This is a live ldap server, on a server that also hosts other services, so I simply can't take the performance hit. I also did a strace -p <slapd pid>, which only gives "futex(0x41c499e0, FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be useful, so here it is: http://www.pastie.org/639170 I understand that this probably isn't enough data to see what the problem is, and so I'm hoping to get some help in debugging this properly. As I've said, args/trace loglevel takes too much of a performance hit, but if this is the only way, we might reconcider. Interesting parts of slapd.conf below. We use slapd.conf instead of cn=config, everything goes over TLS, we're AFAIK only using LDAPv3 binding, and we only have one database entry in slapd.conf. moduleload back_hdb sizelimit 99999 tool-threads 8 threads 16 concurrency 32 backend hdb database hdb cachesize 60000 dbconfig set_cachesize 0 52428800 0 dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500 lastmod on checkpoint 512 30 Best regards, Helge Milde
Date: Fri, 02 Oct 2009 13:54:13 -0700 From: Howard Chu <hyc@symas.com> To: helge@monsternett.no CC: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
helge@monsternett.no wrote: > Full_Name: Helge Milde > Version: slapd 2.4.11 > OS: Debian 5 > URL: > Submission from: (NULL) (81.175.2.164) > > > Hello, > > In the last two days, slapd has randomly stopped working on our server, about > one time every two or three hours. Before this, slapd had been running without > any problems for about two months. slapd.conf has not been touched since then. > > What happens is that slapd stops taking queries (postfix, dovecot, NSS etc. > starts getting timeouts), and it starts using exactly 100% of one core until we > restart it. > After restart, everything works again (until this happens again). > > It seems to only happen when slapd is under "stress" (we have an 8-core system > with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU > load I've seen is about 40% on one core). The last ldap search before slapd > stops working is always the same: > Oct 2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH > base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0 > filter="(&(objectClass=posixGroup))" > .. which should return about 1800 entries. Different connections might query > this simultaneously, and it's a query that happens every ~10 seconds on > average. > > I've tried to reproduce this by doing .. > # while [ 1 ]; do ldapsearch -x -H ldaps://.. -b > 'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done > .. for two minutes without any "luck" :). slapd just uses a bit more CPU, but > never stops working. > > With "loglevel stats stats2", I get nothing out of the logs: > Oct 2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY > dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" > ... time of "crash" (we're mid-search at this point); nothing gets logged until > we do a restart: ... > Oct 2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated. > Oct 2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown) > ... many entires like the one above, and then I seem to be getting the next > entry from the same connection, right before it closes: ... > Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY > dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" > Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown) > > I've tried logging more than just stats, but the interesting loglevels also > causes a big performance hit (trace and args), and I'm not really sure which of > the other loglevels could be helpful. > This is a live ldap server, on a server that also hosts other services, so I > simply can't take the performance hit. > > I also did a strace -p<slapd pid>, which only gives "futex(0x41c499e0, > FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be > useful, so here it is: http://www.pastie.org/639170 strace is useless. Use gdb and get a trace of all running threads when this occurs. http://www.openldap.org/faq/data/cache/59.html -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Date: Fri, 02 Oct 2009 14:27:54 -0700 From: Quanah Gibson-Mount <quanah@zimbra.com> To: helge@monsternett.no, openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
--On Friday, October 02, 2009 8:54 PM +0000 hyc@symas.com wrote: > strace is useless. Use gdb and get a trace of all running threads when > this occurs. > > http://www.openldap.org/faq/data/cache/59.html You also need to test against a current release of OpenLDAP, like OpenLDAP 2.4.18. --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Date: Thu, 10 Dec 2009 10:44:38 +0100 From: Helge Milde <helge@monsternett.no> To: quanah@zimbra.com Cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
I have now installed openldap 2.4.19 from source (default configuration except for --enable-crypt=yes). # slapd -V @(#) $OpenLDAP: slapd 2.4.19 (Dec 9 2009 22:46:15) $ But unfortunately, the bug is still there. Nothing has changed at all.. It has crashed 10 times since I did the migration 10 hours ago. It usually crashes 1-2 times at night, and 10-20 times during work hours (when the servers have more load). When I migrated, I recreated the database by typing: # oldbin/slapcat > db.ldif # newbin/slapadd < db.ldif # (under the new openldap/slapd user) So I can't imagine the database can be corrupt in any way, which I initially thought when I first sent this bug report. It still dies on the same queries as before, in the middle of iterating through ou=users,.. or ou=groups,.., which both have ~1.8k entries. These queries comes from proftpd, which does a "getent passwd" and "getent group" every time a customer logs in. I tried to reproduce this manually again, launching 8 processes, constantly querying ou=users and ou=groups, 2 using a unix socket locally and 6 using ldaps://hostname/ remotely, but it still won't break down when I do that. This causes *alot* more load than proftpd does, but the crashes seem to only happen "randomly". Any suggestions on what I can do to figure out this problem? The LDAP server is live and in use in two servers with 2-3k users, so I can't mess with it *too* much. On Fri, Oct 02, 2009 at 09:28:25PM +0000, quanah@zimbra.com wrote: >--On Friday, October 02, 2009 8:54 PM +0000 hyc@symas.com wrote: > > >> strace is useless. Use gdb and get a trace of all running threads when >> this occurs. >> >> http://www.openldap.org/faq/data/cache/59.html > > >You also need to test against a current release of OpenLDAP, like OpenLDAP >2.4.18. > >--Quanah > > >-- > >Quanah Gibson-Mount >Principal Software Engineer >Zimbra, Inc >-------------------- >Zimbra :: the leader in open source messaging and collaboration > > -- Helge Milde, 69701808 www.monsternett.no
Date: Thu, 10 Dec 2009 08:21:20 -0800 From: Quanah Gibson-Mount <quanah@zimbra.com> To: Helge Milde <helge@monsternett.no> cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
--On Thursday, December 10, 2009 10:44 AM +0100 Helge Milde <helge@monsternett.no> wrote: > I have now installed openldap 2.4.19 from source (default configuration > except for --enable-crypt=yes). > ># slapd -V > @(#) $OpenLDAP: slapd 2.4.19 (Dec 9 2009 22:46:15) $ > > But unfortunately, the bug is still there. Nothing has changed at all.. > It has crashed 10 times since I did the migration 10 hours ago. It > usually crashes 1-2 times at night, and 10-20 times during work hours > (when the servers have more load). Well, now 2.4.20 is current (just to note). What we'll need for resolving your issue is: (a) Your configuration files (minus passwords) (b) A backtrace from gdb of a crash. Make sure you build OpenLDAP with debugging symbols (using CFLAGS of -g) and when you run make install, make sure you use STRIP="", like: make install DESTDIR=/my/path/to/install STRIP="" Thanks, Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Date: Fri, 11 Dec 2009 11:46:46 +0100 From: Helge Milde <helge@monsternett.no> To: Quanah Gibson-Mount <quanah@zimbra.com> Cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
On Thu, Dec 10, 2009 at 08:21:20AM -0800, Quanah Gibson-Mount wrote: >Well, now 2.4.20 is current (just to note). Right. I'm now running this version. The results below are with 2.4.20, and not 2.4.19. >What we'll need for resolving your issue is: > >(a) Your configuration files (minus passwords) # egrep -v '^#|^$' etc/openldap/slapd.conf TLSCertificateFile <removed>.crt TLSCertificateKeyFile <removed>.key TLSCACertificateFile <removed>.ca include /usr/local/ldap/etc/openldap/schema/core.schema include /usr/local/ldap/etc/openldap/schema/cosine.schema include /usr/local/ldap/etc/openldap/schema/nis.schema include /usr/local/ldap/etc/openldap/schema/qmail.schema include /usr/local/ldap/etc/openldap/schema/inetorgperson.schema include /usr/local/ldap/etc/openldap/schema/mod_vhost_ldap.schema include /usr/local/ldap/etc/openldap/schema/monsternett_vhost_ldap.schema include /usr/local/ldap/etc/openldap/schema/sql_user.schema include /usr/local/ldap/etc/openldap/schema/using_space.schema pidfile /usr/local/ldap/var/run/slapd/slapd.pid argsfile /usr/local/ldap/var/run/slapd/slapd.args loglevel stats modulepath /usr/local/ldap/lib moduleload back_hdb sizelimit 99999 tool-threads 8 threads 16 concurrency 32 backend hdb database hdb suffix "dc=<removed>" directory "/usr/local/ldap/var/openldap-data" cachesize 60000 dbconfig set_cachesize 0 52428800 0 dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500 index objectclass,sn eq index cn,uniqueMember,memberUid,uidNumber,gidNumber pres,eq index uid,mail,associatedDomain,mailAlternateAddress,apacheServerName,apacheServerAlias,apacheSuexecUid pres,eq,sub lastmod on checkpoint 512 30 access to attrs=userPassword,shadowLastChange by dn="cn=admin,dc=<removed>" write by anonymous auth by self write by * none access to dn.base="" by * read access to * by dn="cn=admin,dc=<removed>" write by * read >(b) A backtrace from gdb of a crash. Since this is a *live* server, I don't want to run it through gdb if I don't have to. Also, the server doesn't really crash, it just hangs, using 100% of one CPU core. We currently have a script that checks if ldapsearch succeeds, and if not, it restarts slapd. AFAIK, if we would have done this via gdb directly, the ports would still be open, and slapd wouldn't be able to start, which means I would have to sit there for an hour to quickly be able to run a backtrace, so slapd could be restarted instantly :) Anyway, instead, I made it dump a core (by sending killall -11 slapd in my script). I hope this is sufficient. --- gdb core bt start --- (gdb) thread apply all bt Thread 18 (process 7691): #0 0x00007ff212f77b78 in epoll_wait () from /lib/libc.so.6 #1 0x000000000041c618 in slapd_daemon_task (ptr=<value optimized out>) at daemon.c:2465 #2 0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0 #3 0x00007ff212f775ad in clone () from /lib/libc.so.6 #4 0x0000000000000000 in ?? () Thread 17 (process 7692): #0 0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x410f6670) at result.c:225 #2 0x0000000000430d00 in slap_send_search_entry (op=0x2732490, rs=0x41277ca0) at result.c:1260 #3 0x000000000049df00 in hdb_search (op=0x2732490, rs=0x41277ca0) at search.c:961 #4 0x0000000000421491 in fe_op_search (op=0x2732490, rs=0x41277ca0) at search.c:366 #5 0x0000000000421c8c in do_search (op=0x2732490, rs=0x41277ca0) at search.c:217 #6 0x000000000041efff in connection_operation (ctx=0x41277df0, arg_v=<value optimized out>) at connection.c:1109 #7 0x000000000041fcdd in connection_read_thread (ctx=0x41277df0, argv=<value optimized out>) at connection.c:1245 #8 0x00000000004eae20 in ldap_int_thread_pool_wrapper (xpool=<value optimized out>) at tpool.c:685 #9 0x00007ff214c4ffc7 in start_thread () from /lib/libpthread.so.0 #10 0x00007ff212f775ad in clone () from /lib/libc.so.6 #11 0x0000000000000000 in ?? () Thread 16 (process 7694): #0 0x00007ff214c53d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x000000000042efc5 in send_ldap_ber (op=<value optimized out>, ber=0x418f7670) at result.c:225 #2 0x0000000000430d00 in slap_send_search_entry (op=0x285ba50, rs=0x41a78ca0) at result.c:1260 #3 0x000000000049df00 in hdb_search (op=0x285ba50, rs=0x41a78ca0) at search.c:961 #4 0x0000000000421491 in fe_op_search (op=0x285ba50, rs=0x41a78ca0) at search.c:366 #5 0x0000000000421c8c in do_search (op=0x285ba50, rs=0x41a78ca0) at search.c:217 #6 0x000000000041efff in connection_operation (ctx=0x41a78df0, arg_v=<value optimized out>) at connection.c:1109 #7 0x000000000041fcdd in connection_read_thread (ctx=0x41a78df0, argv=<value optimized out>) at connection.c:1
Date: Tue, 29 Dec 2009 09:10:04 +0100 From: Helge Milde <helge@monsternett.no> To: Quanah Gibson-Mount <quanah@zimbra.com> Cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
Would be nice to get some feedback on this one.. Were using a core dump sufficient? If so, was the information helpful at all? If there's anything more I could do, please say so. -- Helge Milde, 69701808 www.monsternett.no
Date: Tue, 29 Dec 2009 11:00:25 -0800 From: Howard Chu <hyc@symas.com> To: helge@monsternett.no CC: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
helge@monsternett.no wrote: > Would be nice to get some feedback on this one.. > Were using a core dump sufficient? If so, was the information helpful at all? > If there's anything more I could do, please say so. > Sorry, your issue was closed since we hadn't received any feedback from you in a couple of months. The last trace you sent shows nothing unusual, just a bunch of threads all waiting to write on their respective connections. Have you tried the .21 release yet? -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Date: Wed, 30 Dec 2009 09:04:12 +0100 From: Helge Milde <helge@monsternett.no> To: Howard Chu <hyc@symas.com> Cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
On Tue, Dec 29, 2009 at 11:00:25AM -0800, Howard Chu wrote: >Sorry, your issue was closed since we hadn't received any feedback >from you in a couple of months. > >The last trace you sent shows nothing unusual, just a bunch of >threads all waiting to write on their respective connections. If the core dump backtrace wasn't helpful, then please tell me so; I thought I had given you sufficient feedback, and that you were working on it. >Have you tried the .21 release yet? No.. I might do that next week. Is there anything else that I can do? I just enabled trace loglevel, perhaps that might give us a clue on what's going on. >-- > -- Howard Chu > CTO, Symas Corp. http://www.symas.com > Director, Highland Sun http://highlandsun.com/hyc/ > Chief Architect, OpenLDAP http://www.openldap.org/project/ -- Helge Milde, 69701808 www.monsternett.no
Date: Wed, 30 Dec 2009 14:08:00 +0100 From: Helge Milde <helge@monsternett.no> To: Howard Chu <hyc@symas.com> Cc: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote: >I just enabled trace loglevel, perhaps that might give us a clue on what's going on. Slapd just stopped working again. Here's the (annotated) log for viewing and/or downloading: http://www.pastie.org/761198.txt Does this give you any clues? Tell me if I can do anything else to help. -- Helge Milde, 69701808 www.monsternett.no
Date: Tue, 05 Jan 2010 13:39:40 -0800 From: Quanah Gibson-Mount <quanah@zimbra.com> To: helge@monsternett.no, openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
--On Wednesday, December 30, 2009 1:09 PM +0000 helge@monsternett.no wrote: > On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote: >> I just enabled trace loglevel, perhaps that might give us a clue on >> what's going on. > > Slapd just stopped working again. Here's the (annotated) log for viewing > and/or downloading: http://www.pastie.org/761198.txt > > Does this give you any clues? > > Tell me if I can do anything else to help. A fix for this has been committed to HEAD. You can apply it to the latest source (2.4.21) if you would like to try it out: <http://www.openldap.org/devel/cvsweb.cgi/servers/slapd/daemon.c.diff?hideattic=1&r1=text&tr1=1.442&r2=text&tr2=1.443&f=u> --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Date: Tue, 05 Jan 2010 13:46:23 -0800 From: Quanah Gibson-Mount <quanah@zimbra.com> To: helge@monsternett.no, openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
--On Tuesday, January 05, 2010 1:39 PM -0800 Quanah Gibson-Mount <quanah@zimbra.com> wrote: > --On Wednesday, December 30, 2009 1:09 PM +0000 helge@monsternett.no > wrote: > >> On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote: >>> I just enabled trace loglevel, perhaps that might give us a clue on >>> what's going on. >> >> Slapd just stopped working again. Here's the (annotated) log for viewing >> and/or downloading: http://www.pastie.org/761198.txt >> >> Does this give you any clues? >> >> Tell me if I can do anything else to help. > > A fix for this has been committed to HEAD. You can apply it to the > latest source (2.4.21) if you would like to try it out: > > <http://www.openldap.org/devel/cvsweb.cgi/servers/slapd/daemon.c.diff?hid > eattic=1&r1=text&tr1=1.442&r2=text&tr2=1.443&f=u> Hm, correction, it fixes a minor shutdown issue, not your main issue. :/ --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Date: Mon, 15 Feb 2010 09:53:54 +0100 From: Helge Milde <helge@monsternett.no> To: openldap-its@openldap.org Subject: Re: (ITS#6322) slapd suddenly stops working, and starts using 100% CPU
Hello again. We randomly tried to switch to LDAP without SSL last Friday (we used ldaps://..:636/ before), and the server hasn't crashed since! This is a good enough workaround for us, but I hope this information can help with fixing this bug, or help others with the same problem. On Wed, Dec 30, 2009 at 02:08:00PM +0100, Helge Milde wrote: >On Wed, Dec 30, 2009 at 09:04:12AM +0100, Helge Milde wrote: >>I just enabled trace loglevel, perhaps that might give us a clue on what's going on. > >Slapd just stopped working again. Here's the (annotated) log for viewing and/or downloading: >http://www.pastie.org/761198.txt > >Does this give you any clues? > >Tell me if I can do anything else to help. > >-- >Helge Milde, 69701808 >www.monsternett.no -- Helge Milde, 69701808 www.monsternett.no
______________ © Copyright 2013, OpenLDAP Foundation, info@OpenLDAP.org