7222 – Slapd hangs on high load

Issue 7222 - Slapd hangs on high load

Summary: Slapd hangs on high load

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.30
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-04-03 14:27 UTC by hrvoje.habjanic@zg.t-com.hr
Modified:	2014-08-01 21:04 UTC (History)
CC List:	0 users

See Also:

Attachments
ol.diff (962 bytes, patch) 2012-04-11 08:29 UTC, hrvoje.habjanic@zg.t-com.hr	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description hrvoje.habjanic@zg.t-com.hr 2012-04-03 14:27:43 UTC

Full_Name: Hrvoje
Version: 2.4.30
OS: Centos 6.2 x86_64
URL: http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log
Submission from: (NULL) (195.29.148.138)



Hi.

While testing openlap, with some of my data, slapd regularly hangs. I did manage
to "catch" it, but i need expert's interpretation of traces.

I' using db-5.3.15 (latest), compiled with:

../dist/configure \
                --enable-shared --enable-static \
                --enable-tcl --with-tcl=/usr/lib64 \
                --enable-cxx --enable-sql \
                --enable-java \
                --enable-test \
                --with-tcl=/usr/lib64/tcl8.5 \
                --disable-rpath \
                --enable-debug \
                --prefix=/usr/local/db

and openldap-2.4.30, compiled with:

CFLAGS="-g -I/usr/local/db/include" CPPFLAGS="-g -I/usr/local/db/include"
LDFLAGS="-L/usr/local/db/lib -Wl,-R/usr/local/db/lib" ./configure \
 --prefix=/usr/local/openldap \
 --enable-local \
 --enable-rlookups \
 --with-tls=no \
 --with-cyrus-sasl \
 --enable-wrappers \
 --enable-passwd \
 --enable-cleartext \
 --enable-crypt \
 --enable-spasswd \
 --disable-lmpasswd \
 --enable-modules \
 --disable-sql \
 --enable-slapd \
 --enable-bdb \
 --enable-hdb \
 --enable-ldap \
 --enable-meta \
 --enable-monitor \
 --enable-null \
 --enable-shell \
 --disable-ndb \
 --enable-passwd \
 --enable-sock \
 --disable-perl \
 --enable-relay \
 --disable-shared \
 --disable-dynamic \
 --enable-overlays=mod \
 --enable-mdb \
 --enable-debug=yes

Slapd is configured to use slapd.d directory (db). Inside, two databases are
configured - ie. ou=p,dc=pero,dc=com and ou=d,dc=pero,dc=com, including monitor
db. First database is using 10Gb on disk, and have around 10M unique dn's, while
second one is using around 3-4Gb, few mil. dn's.

Server have 16G of ram, and 2xquad core CPU - total of 8 cpu's (and disks are
local).

I'm using python scripts to generate load on openldap. First i fill in required
data (10Gb), and then do some transaction processing (read/update/write).

Filling part goes without problems, but on transaction processing, slapd
regularly gets stuck. I'm only able to trigger this using more than one
connection - simulating  couple of clients, and high load (1-2 req/sec).
Complete traces from gdb when this happens, are
http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log .

So, am i doing something wrong or openldap is...?

H.

Comment 1 Howard Chu 2012-04-03 15:41:35 UTC

hrvoje.habjanic@zg.t-com.hr wrote:
> Full_Name: Hrvoje
> Version: 2.4.30
> OS: Centos 6.2 x86_64
> URL: http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log
> Submission from: (NULL) (195.29.148.138)
>
>
>
> Hi.
>
> While testing openlap, with some of my data, slapd regularly hangs. I did manage
> to "catch" it, but i need expert's interpretation of traces.
>
> I' using db-5.3.15 (latest), compiled with:
>
> ../dist/configure \
>                  --enable-shared --enable-static \
>                  --enable-tcl --with-tcl=/usr/lib64 \
>                  --enable-cxx --enable-sql \
>                  --enable-java \
>                  --enable-test \
>                  --with-tcl=/usr/lib64/tcl8.5 \
>                  --disable-rpath \
>                  --enable-debug \
>                  --prefix=/usr/local/db
>
> and openldap-2.4.30, compiled with:
>
> CFLAGS="-g -I/usr/local/db/include" CPPFLAGS="-g -I/usr/local/db/include"
> LDFLAGS="-L/usr/local/db/lib -Wl,-R/usr/local/db/lib" ./configure \
>   --prefix=/usr/local/openldap \
>   --enable-local \
>   --enable-rlookups \
>   --with-tls=no \
>   --with-cyrus-sasl \
>   --enable-wrappers \
>   --enable-passwd \
>   --enable-cleartext \
>   --enable-crypt \
>   --enable-spasswd \
>   --disable-lmpasswd \
>   --enable-modules \
>   --disable-sql \
>   --enable-slapd \
>   --enable-bdb \
>   --enable-hdb \
>   --enable-ldap \
>   --enable-meta \
>   --enable-monitor \
>   --enable-null \
>   --enable-shell \
>   --disable-ndb \
>   --enable-passwd \
>   --enable-sock \
>   --disable-perl \
>   --enable-relay \
>   --disable-shared \
>   --disable-dynamic \
>   --enable-overlays=mod \
>   --enable-mdb \
>   --enable-debug=yes
>
> Slapd is configured to use slapd.d directory (db). Inside, two databases are
> configured - ie. ou=p,dc=pero,dc=com and ou=d,dc=pero,dc=com, including monitor
> db. First database is using 10Gb on disk, and have around 10M unique dn's, while
> second one is using around 3-4Gb, few mil. dn's.
>
> Server have 16G of ram, and 2xquad core CPU - total of 8 cpu's (and disks are
> local).
>
> I'm using python scripts to generate load on openldap. First i fill in required
> data (10Gb), and then do some transaction processing (read/update/write).
>
> Filling part goes without problems, but on transaction processing, slapd
> regularly gets stuck. I'm only able to trigger this using more than one
> connection - simulating  couple of clients, and high load (1-2 req/sec).
> Complete traces from gdb when this happens, are
> http://free-zg.t-com.hr/HrvojeHabjanic/hang2.log .
>
> So, am i doing something wrong or openldap is...?

Looks like your glibc malloc is deadlocked. A Centos bug, not an OpenLDAP bug.

In the trace, you could confirm this in gdb with:
	thread 13
	frame 3
	print *mutex

most likely the "owner" field of this mutex will be 1502, which corresponds to 
thread 17, which is waiting for a lock inside libc malloc/free.

You may be able to avoid this bug by using an alternate malloc library, such 
as Google tcmalloc.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 hrvoje.habjanic@zg.t-com.hr 2012-04-04 07:52:21 UTC

On 03.04.2012 17:41, Howard Chu wrote:
>>
>> So, am i doing something wrong or openldap is...?
>
>
> Looks like your glibc malloc is deadlocked. A Centos bug, not an
> OpenLDAP bug.
>
> In the trace, you could confirm this in gdb with:
>     thread 13
>     frame 3
>     print *mutex
>
> most likely the "owner" field of this mutex will be 1502, which
> corresponds to thread 17, which is waiting for a lock inside libc
> malloc/free.
>
> You may be able to avoid this bug by using an alternate malloc
> library, such as Google tcmalloc.
>

Hi.

Thanx for inside info ... :-)

And sorry that i was unable to provide more info - core dump alone is
16gb! Also, small sidenote - when this "hang" happens, it only affects
existing connections - i'm attacking it with two procesess, each 4
connection. New searches using ldapsearch work fine ...

And correction for typo - by "high load" i wrote (1-2 req/sec) -
actually it should write 1-2k reg/sec ...

What is interesting regarding this, that this "problem" goes back to
db-4.7 and openldap-2.4.23 (provided with centos) ...

I'll try alternate malloc and report back ...

H.

Comment 3 hrvoje.habjanic@zg.t-com.hr 2012-04-04 15:45:30 UTC

On 03.04.2012 17:41, Howard Chu wrote:
>
> You may be able to avoid this bug by using an alternate malloc
> library, such as Google tcmalloc.
>

Hi.

I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is
that this happened in "pthread_mutex_lock" which is in libpthread.so ...?

Another bug in centos libs? I would appreciate if you could take a look.

Thx.

H.

p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log

Comment 4 hrvoje.habjanic@zg.t-com.hr 2012-04-08 11:25:39 UTC

On 04.04.2012 17:45, Hrvoje Habjanić wrote:
> On 03.04.2012 17:41, Howard Chu wrote:
>> You may be able to avoid this bug by using an alternate malloc
>> library, such as Google tcmalloc.
>>
> Hi.
>
> I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is
> that this happened in "pthread_mutex_lock" which is in libpthread.so ...?
>
> Another bug in centos libs? I would appreciate if you could take a look.
>
> Thx.
>
> H.
>
> p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log

Hi.

And, one more SIGSEGV ...

Should i open a new ITS?

H.

p.s. http://free-zg.t-com.hr/HrvojeHabjanic/openldap/ssegv.log

Comment 5 hrvoje.habjanic@zg.t-com.hr 2012-04-09 18:09:46 UTC

On 08.04.2012 13:25, Hrvoje Habjanić wrote:
> On 04.04.2012 17:45, Hrvoje Habjanić wrote:
>> On 03.04.2012 17:41, Howard Chu wrote:
>>> You may be able to avoid this bug by using an alternate malloc
>>> library, such as Google tcmalloc.
>>>
>> Hi.
>>
>> I did try - using tcmalloc. And this time, i got SIGSEGV. Odd thing is
>> that this happened in "pthread_mutex_lock" which is in libpthread.so ...?
>>
>> Another bug in centos libs? I would appreciate if you could take a look.
>>
>> Thx.
>>
>> H.
>>
>> p.s. url -> http://free-zg.t-com.hr/HrvojeHabjanic/hang3.log
>

Hi.

Two more "hang"s, both in sched_yield(). This is with replacement malloc
(tcmalloc, minimal).

http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang4.log
http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang5.log

H.

Comment 6 hrvoje.habjanic@zg.t-com.hr 2012-04-11 08:29:57 UTC

On 09.04.2012 20:09, Hrvoje Habjanić wrote:
>
> Hi.
>
> Two more "hang"s, both in sched_yield(). This is with replacement malloc
> (tcmalloc, minimal).
>
> http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang4.log
> http://free-zg.t-com.hr/HrvojeHabjanic/openldap/hang5.log
>
> H.

Hi.

Attached patch if solving my proglem with "hang" in sched_yield.

In general, i do think that there (cache management) is a lot of
unnecessary locking and waiting ... And simplifying things there would
solve a lot of problems ... Probably. :-)

Of course, i'm not shure how will this change influence the rest of the
code, but it does work for me (tm).

H.

p.s. Also available at
http://free-zg.t-com.hr/HrvojeHabjanic/openldap/ol.diff

Comment 7 Howard Chu 2012-05-31 17:49:33 UTC

changed notes
changed state Open to Test
moved from Incoming to Software Bugs

Comment 8 Quanah Gibson-Mount 2012-05-31 18:10:49 UTC

changed notes
changed state Test to Release

Comment 9 Quanah Gibson-Mount 2012-08-17 01:36:22 UTC

changed notes
changed state Release to Closed

Comment 10 OpenLDAP project 2014-08-01 21:04:42 UTC

applied to master
applied to RE24