Full_Name: Hrvoje Version: 2.4.30 OS: Centos 6.2 x86_64 URL: http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log Submission from: (NULL) (195.29.148.138) Hi. While testing openlap, with some of my data, slapd regularly hangs. I did manage to "catch" it, but i need expert's interpretation of traces. I' using db-5.3.15 (latest), compiled with: ../dist/configure \ --enable-shared --enable-static \ --enable-tcl --with-tcl=/usr/lib64 \ --enable-cxx --enable-sql \ --enable-java \ --enable-test \ --with-tcl=/usr/lib64/tcl8.5 \ --disable-rpath \ --enable-debug \ --prefix=/usr/local/db and openldap-2.4.30, compiled with: CFLAGS="-g -I/usr/local/db/include" CPPFLAGS="-g -I/usr/local/db/include" LDFLAGS="-L/usr/local/db/lib -Wl,-R/usr/local/db/lib" ./configure \ --prefix=/usr/local/openldap \ --enable-local \ --enable-rlookups \ --with-tls=no \ --with-cyrus-sasl \ --enable-wrappers \ --enable-passwd \ --enable-cleartext \ --enable-crypt \ --enable-spasswd \ --disable-lmpasswd \ --enable-modules \ --disable-sql \ --enable-slapd \ --enable-bdb \ --enable-hdb \ --enable-ldap \ --enable-meta \ --enable-monitor \ --enable-null \ --enable-shell \ --disable-ndb \ --enable-passwd \ --enable-sock \ --disable-perl \ --enable-relay \ --disable-shared \ --disable-dynamic \ --enable-overlays=mod \ --enable-mdb \ --enable-debug=yes Slapd is configured to use slapd.d directory (db). Inside, two databases are configured - ie. ou=p,dc=pero,dc=com and ou=d,dc=pero,dc=com, including monitor db. First database is using 10Gb on disk, and have around 10M unique dn's, while second one is using around 3-4Gb, few mil. dn's. Server have 16G of ram, and 2xquad core CPU - total of 8 cpu's (and disks are local). I'm using python scripts to generate load on openldap. First i fill in required data (10Gb), and then do some transaction processing (read/update/write). Filling part goes without problems, but on transaction processing, slapd regularly gets stuck. I'm only able to trigger this using more than one connection - simulating couple of clients, and high load (1-2 req/sec). Complete traces from gdb when this happens, are http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log . So, am i doing something wrong or openldap is...? H.
hrvoje.habjanic@zg.t-com.hr wrote: > Full_Name: Hrvoje > Version: 2.4.30 > OS: Centos 6.2 x86_64 > URL: http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log > Submission from: (NULL) (195.29.148.138) > > > > Hi. > > While testing openlap, with some of my data, slapd regularly hangs. I did manage > to "catch" it, but i need expert's interpretation of traces. > > I' using db-5.3.15 (latest), compiled with: > > ../dist/configure \ > --enable-shared --enable-static \ > --enable-tcl --with-tcl=/usr/lib64 \ > --enable-cxx --enable-sql \ > --enable-java \ > --enable-test \ > --with-tcl=/usr/lib64/tcl8.5 \ > --disable-rpath \ > --enable-debug \ > --prefix=/usr/local/db > > and openldap-2.4.30, compiled with: > > CFLAGS="-g -I/usr/local/db/include" CPPFLAGS="-g -I/usr/local/db/include" > LDFLAGS="-L/usr/local/db/lib -Wl,-R/usr/local/db/lib" ./configure \ > --prefix=/usr/local/openldap \ > --enable-local \ > --enable-rlookups \ > --with-tls=no \ > --with-cyrus-sasl \ > --enable-wrappers \ > --enable-passwd \ > --enable-cleartext \ > --enable-crypt \ > --enable-spasswd \ > --disable-lmpasswd \ > --enable-modules \ > --disable-sql \ > --enable-slapd \ > --enable-bdb \ > --enable-hdb \ > --enable-ldap \ > --enable-meta \ > --enable-monitor \ > --enable-null \ > --enable-shell \ > --disable-ndb \ > --enable-passwd \ > --enable-sock \ > --disable-perl \ > --enable-relay \ > --disable-shared \ > --disable-dynamic \ > --enable-overlays=mod \ > --enable-mdb \ > --enable-debug=yes > > Slapd is configured to use slapd.d directory (db). Inside, two databases are > configured - ie. ou=p,dc=pero,dc=com and ou=d,dc=pero,dc=com, including monitor > db. First database is using 10Gb on disk, and have around 10M unique dn's, while > second one is using around 3-4Gb, few mil. dn's. > > Server have 16G of ram, and 2xquad core CPU - total of 8 cpu's (and disks are > local). > > I'm using python scripts to generate load on openldap. First i fill in required > data (10Gb), and then do some transaction processing (read/update/write). > > Filling part goes without problems, but on transaction processing, slapd > regularly gets stuck. I'm only able to trigger this using more than one > connection - simulating couple of clients, and high load (1-2 req/sec). > Complete traces from gdb when this happens, are > http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log . > > So, am i doing something wrong or openldap is...? Looks like your glibc malloc is deadlocked. A Centos bug, not an OpenLDAP bug. In the trace, you could confirm this in gdb with: thread 13 frame 3 print *mutex most likely the "owner" field of this mutex will be 1502, which corresponds to thread 17, which is waiting for a lock inside libc malloc/free. You may be able to avoid this bug by using an alternate malloc library, such as Google tcmalloc. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On 03.04.2012 17:41, Howard Chu wrote: >> >> So, am i doing something wrong or openldap is...? > > > Looks like your glibc malloc is deadlocked. A Centos bug, not an > OpenLDAP bug. > > In the trace, you could confirm this in gdb with: > thread 13 > frame 3 > print *mutex > > most likely the "owner" field of this mutex will be 1502, which > corresponds to thread 17, which is waiting for a lock inside libc > malloc/free. > > You may be able to avoid this bug by using an alternate malloc > library, such as Google tcmalloc. > Hi. Thanx for inside info ... :-) And sorry that i was unable to provide more info - core dump alone is 16gb! Also, small sidenote - when this "hang" happens, it only affects existing connections - i'm attacking it with two procesess, each 4 connection. New searches using ldapsearch work fine ... And correction for typo - by "high load" i wrote (1-2 req/sec) - actually it should write 1-2k reg/sec ... What is interesting regarding this, that this "problem" goes back to db-4.7 and openldap-2.4.23 (provided with centos) ... I'll try alternate malloc and report back ... H.
On 03.04.2012 17:41, Howard Chu wrote: > > You may be able to avoid this bug by using an alternate malloc > library, such as Google tcmalloc. > Hi. I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is that this happened in "pthread_mutex_lock" which is in libpthread.so ...? Another bug in centos libs? I would appreciate if you could take a look. Thx. H. p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log
On 04.04.2012 17:45, Hrvoje Habjanić wrote: > On 03.04.2012 17:41, Howard Chu wrote: >> You may be able to avoid this bug by using an alternate malloc >> library, such as Google tcmalloc. >> > Hi. > > I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is > that this happened in "pthread_mutex_lock" which is in libpthread.so ...? > > Another bug in centos libs? I would appreciate if you could take a look. > > Thx. > > H. > > p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log Hi. And, one more SIGSEGV ... Should i open a new ITS? H. p.s. http://free-zg.t-com.hr/HrvojeHabjanic/openldap/ssegv.log
On 08.04.2012 13:25, Hrvoje Habjanić wrote: > On 04.04.2012 17:45, Hrvoje Habjanić wrote: >> On 03.04.2012 17:41, Howard Chu wrote: >>> You may be able to avoid this bug by using an alternate malloc >>> library, such as Google tcmalloc. >>> >> Hi. >> >> I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is >> that this happened in "pthread_mutex_lock" which is in libpthread.so ...? >> >> Another bug in centos libs? I would appreciate if you could take a look. >> >> Thx. >> >> H. >> >> p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log > Hi. Two more "hang"s, both in sched_yield(). This is with replacement malloc (tcmalloc, minimal). http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang4.log http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang5.log H.
On 09.04.2012 20:09, Hrvoje Habjanić wrote: > > Hi. > > Two more "hang"s, both in sched_yield(). This is with replacement malloc > (tcmalloc, minimal). > > http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang4.log > http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang5.log > > H. Hi. Attached patch if solving my proglem with "hang" in sched_yield. In general, i do think that there (cache management) is a lot of unnecessary locking and waiting ... And simplifying things there would solve a lot of problems ... Probably. :-) Of course, i'm not shure how will this change influence the rest of the code, but it does work for me (tm). H. p.s. Also available at http://free-zg.t-com.hr/HrvojeHabjanic/openldap/ol.diff
changed notes changed state Open to Test moved from Incoming to Software Bugs
changed notes changed state Test to Release
changed notes changed state Release to Closed
applied to master applied to RE24