6200 – slapd crashes under load w/ syncrepl

Issue 6200 - slapd crashes under load w/ syncrepl

Summary: slapd crashes under load w/ syncrepl

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-07-08 16:44 UTC by imf@u.washington.edu
Modified:	2014-08-01 21:03 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description imf@u.washington.edu 2009-07-08 16:44:23 UTC

Full_Name: Tracy Stenvik
Version: 2.4.16 (+db-4.7.25pl4)
OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (140.142.111.41)


In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we have
encountered some serious issues.  A similated production-level load that
2.3.39 (with slurpd running) handles easily causes 2.4.16 (and 2.4.15) to
crash within 5 minutes to 24 hours.

Without replication active, 2.4.16 survived 24 hours under load.  As soon as
replication was activated however, the server crashed within 5 minutes.
We've tried several syncrepl variants (as described below) but none appear
stable.

In analyzing the various crash dumps, most of the trouble appears to lie in
the caching code.  It has a nasty habit of returing nulls when nulls are not
expected.

Our simulated load consists of an arbitrary mix of adds/modifies/searches to 
various objects in the directory.  This load is defined as "N" instances of 
the "Stressor" test, as discussed on the "GDS Stress Test Findings" page at,

  http://wiki.cac.washington.edu/display/infra/GDS+Stress+Test+Findings

Following, you will find a series of crash dumps, along with a few comments.
Also included are sections the various slapd.confs we were testing with.

Please contact me if you require additional information.



delta syncrepl crash
--------------------
(gdb) where
#0  0xb7b7abab in strncmp () from /lib/libc.so.6
#1  0x08143ea4 in bdb_rdn_cmp (v_e1=0xf069be8, v_e2=0xf73cb48) at cache.c:301
#2  0x081d00ee in avl_insert (root=0x8622088, data=0xf069be8, fcmp=0x8143e55
<bdb_rdn_cmp>, fdup=0x81d10a0 <avl_dup_error>) at avl.c:95
#3  0x08144692 in hdb_cache_find_parent (op=0xf95ea78, txn=0xe7def98, id=343637,
res=0x14b3ade0) at cache.c:593
#4  0x08144dc9 in hdb_cache_find_id (op=0xf95ea78, tid=0xe7def98, id=343637,
eip=0x14b3ade0, flag=0, lock=0x14b3adc4) at cache.c:905
#5  0x0810b6a0 in hdb_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:705
#6  0x080e9b02 in overlay_op_walk (op=0xf95ea78, rs=0x14bfc0e8, which=op_search,
oi=0x85aeed8, on=0x0) at backover.c:669
#7  0x080e9cb7 in over_op_func (op=0xf95ea78, rs=0x14bfc0e8, which=op_search) at
backover.c:721
#8  0x080e9d3b in over_op_search (op=0xf95ea78, rs=0x14bfc0e8) at
backover.c:743
#9  0x0806f0b3 in fe_op_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:366
#10 0x0806ea53 in do_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:217
#11 0x0806b959 in connection_operation (ctx=0x14bfc1d0, arg_v=0xf95ea78) at
connection.c:1097
#12 0x0806be33 in connection_read_thread (ctx=0x14bfc1d0, argv=0x23) at
connection.c:1223
#13 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8561178) at tpool.c:663
#14 0xb7e4d49b in start_thread () from /lib/libpthread.so.0
#15 0xb7bdc42e in clone () from /lib/libc.so.6
(gdb) p *(EntryInfo *)v_e1
$2 = {bei_parent = 0x8622058, bei_id = 343637, bei_lockpad = 0, bei_state = 132,
bei_finders = 0, bei_nrdn = {bv_len = 45, 
    bv_val = 0xf7af1f8 "serialNumber=40442595174b5ebf53167252b5cbcc34"}, bei_rdn
= {bv_len = 45, 
    bv_val = 0xf36d700 "serialNumber=40442595174B5EBF53167252B5CBCC34"},
bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, 
  bei_kids_mutex = {__data = {__lock = 1, __count = 0, __owner = 32550, __kind =
0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, 
    __size = "\001\000\000\000\000\000\000\000&\177\000\000\000\000\000\000\001\000\000\000\000\000\000",
__align = 1}, bei_lrunext = 0x8978ad0, 
  bei_lruprev = 0x0}
(gdb) p *(EntryInfo *)v_e2
$3 = {bei_parent = 0x0, bei_id = 410257, bei_lockpad = 0, bei_state = 0,
bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0x0}, bei_rdn = {
    bv_len = 45, bv_val = 0x0}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0,
bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list =
{__next = 0x0}}}, __size = '\0' <repeats 23 times>, __align = 0}, 
  bei_lrunext = 0x8978ad0, bei_lruprev = 0x0}

In bdb_rdn_cmp() you're comparing e1->bei_nrdn.bv_val to e2->bei_nrdn.bv_val
but the latter is null even though e2->bei_nrdn.bv_len is not zero.


standard syncrepl crash
-----------------------
(gdb) where
#0  0xb7d9f300 in pthread_mutex_lock () from /lib/libpthread.so.0
#1  0x081d8645 in ldap_pvt_thread_mutex_lock (mutex=0x34) at thr_posix.c:296
#2  0x08145bc0 in bdb_cache_delete_internal (cache=0xa25cc04, e=0x147e6d38,
decr=0) at cache.c:1372
#3  0x08144b08 in hdb_cache_lru_purge (bdb=0xa25cbc8) at cache.c:776
#4  0x081452d1 in hdb_cache_find_id (op=0x11923ca8, tid=0x1316aa48, id=280869,
eip=0x2bffcbfc, flag=0, lock=0x2bffcc54) at cache.c:1053
#5  0x0814b2f7 in hdb_dn2entry (op=0x11923ca8, tid=0x1316aa48, dn=0xa25e3bc,
e=0x2bffcc64, matched=0, lock=0x2bffcc54) at dn2entry.c:67
#6  0x08151845 in hdb_entry_get (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0,
at=0xa23ebb0, rw=0, ent=0x2bffceac) at id2entry.c:354
#7  0x080e936d in overlay_entry_get_ov (op=0x11923ca8, dn=0xa25e3bc,
oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac, on=0x0) at backover.c:364
#8  0x080e941f in over_entry_get_rw (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0,
ad=0xa23ebb0, rw=0, e=0x2bffceac) at backover.c:396
#9  0x0807cfc3 in be_entry_get_rw (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0,
at=0xa23ebb0, rw=0, e=0x2bffceac) at backend.c:1346
#10 0x0807d1a2 in fe_acl_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc,
op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1402
#11 0x0807d8c3 in backend_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc,
op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1597
#12 0x080d32e2 in limits_get (op=0x11923ca8, limit=0x11923cdc) at limits.c:117
#13 0x080d588d in limits_check (op=0x11923ca8, rs=0x2bffe0e8) at limits.c:1106
#14 0x0806f095 in fe_op_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:364
#15 0x0806ea53 in do_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:217
#16 0x0806b959 in connection_operation (ctx=0x2bffe1d0, arg_v=0x11923ca8) at
connection.c:1097
#17 0x0806be33 in connection_read_thread (ctx=0x2bffe1d0, argv=0x1d) at
connection.c:1223
#18 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0xa211178) at tpool.c:663
#19 0xb7d9d49b in start_thread () from /lib/libpthread.so.0
#20 0xb7b2c42e in clone () from /lib/libc.so.6
(gdb) p *e->bei_parent
Cannot access memory at address 0x0

In bdb_cache_delete_internal(), e->bei_parent is null.


2-way multimaster syncrepl crash
--------------------------------
(gdb) where
#0  0x08149c1a in hdb_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:217
#1  0x080e9b02 in overlay_op_walk (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete,
oi=0x88b4550, on=0x0) at backover.c:669
#2  0x080e9cb7 in over_op_func (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete) at
backover.c:721
#3  0x080e9de5 in over_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at
backover.c:773
#4  0x0808ae88 in fe_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:174
#5  0x0808abdb in do_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:95
#6  0x0806b959 in connection_operation (ctx=0x2ecfc1d0, arg_v=0xc01caa8) at
connection.c:1097
#7  0x0806be33 in connection_read_thread (ctx=0x2ecfc1d0, argv=0x22) at
connection.c:1223
#8  0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8867178) at tpool.c:663
#9  0xb7e4749b in start_thread () from /lib/libpthread.so.0
#10 0xb7bd642e in clone () from /lib/libc.so.6
(gdb) p eip
$1 = (EntryInfo *) 0x0

In hdb_delete(), eip is null.


2-way multimaster syncrepl crash
--------------------------------
(gdb) where
#0  0x080956db in slap_acl_get (a=0x8b9ba80, count=0x37f5d824, op=0x8c5d948,
e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, matches=0x37f5d1c4, state=0x37f5d97c) at
acl.c:599
#1  0x080949b9 in slap_access_allowed (op=0x8c5d948, e=0x37f5dad0,
desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at
acl.c:236
#2  0x08094cd5 in fe_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0,
val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:360
#3  0x0809503c in access_allowed_mask (op=0x8c5d948, e=0x37f5dad0,
desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x0) at
acl.c:462
#4  0x08098b8f in acl_check_modlist (op=0x8c5d948, e=0x37f5dad0,
mlist=0x34afa7c8) at acl.c:1986
#5  0x0810373a in hdb_modify_internal (op=0x8c5d948, tid=0x961aa10,
modlist=0x34afa7c8, e=0x37f5dad0, text=0x37f5f0fc, textbuf=0x37f5db18 "",
textlen=256) at modify.c:88
#6  0x08105292 in hdb_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:590
#7  0x080e9b02 in overlay_op_walk (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify,
oi=0x8bbb550, on=0x0) at backover.c:669
#8  0x080e9cb7 in over_op_func (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify) at
backover.c:721
#9  0x080e9d7f in over_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at
backover.c:755
#10 0x08088a89 in fe_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:301
#11 0x080884c8 in do_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:175
#12 0x0806b959 in connection_operation (ctx=0x37f5f1d0, arg_v=0x8c5d948) at
connection.c:1097
#13 0x0806be33 in connection_read_thread (ctx=0x37f5f1d0, argv=0x22) at
connection.c:1223
#14 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8b6e178) at tpool.c:663
#15 0xb7e6749b in start_thread () from /lib/libpthread.so.0
#16 0xb7bf642e in clone () from /lib/libc.so.6
(gdb) p  e->e_nname
$3 = {bv_len = 77, bv_val = 0x0}

In slap_acl_get(), e->e_nname.bv_val (e->e_ndn) is null.



delta syncrepl slapd.conf
-------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Replication db ====
database        hdb
suffix          "cn=log"
directory       /data/au20aa_log
rootdn          "cn=Manager,dc=washington,dc=edu"
#index           objectClass     eq
#index           entryUUID,entryCSN      eq
index           default eq
index           entryCSN,objectClass,reqEnd,reqResult,reqStart

overlay syncprov
syncprov-reloadhint true
syncprov-nopresent true

# ===== Primary db ===
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60

# accesslog overlay definitions for primary database
overlay accesslog
logdb cn=log
logops writes
logsuccess true

# scan the accesslog DB every day, and purge entries older than 7 days
logpurge 07+00:00 01+00:00


standard syncrepl slapd.conf
----------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Primary db =====
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60



2-way multimaster syncrepl slapd.conf
-------------------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Primary db =====
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60

###
### ----------- Replication
###

syncrepl        rid=101
                provider=ldap://rufus01.u.washington.edu:290
                searchbase="dc=washington,dc=edu"
                bindmethod=sasl
                saslmech=external
                starttls=yes
                tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt
                tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key
                filter="(objectClass=*)"
                attrs="*,+"
                schemachecking=off
                scope=sub
                type=refreshAndPersist
                retry="60 +"

syncrepl        rid=102
                provider=ldap://rufus04.u.washington.edu:290
                searchbase="dc=washington,dc=edu"
                bindmethod=sasl
                saslmech=external
                starttls=yes
                tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt
                tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key
                filter="(objectClass=*)"
                attrs="*,+"
                schemachecking=off
                scope=sub
                type=refreshAndPersist
                retry="60 +"

mirrormode      yes

Comment 1 Quanah Gibson-Mount 2009-07-08 18:13:59 UTC

--On Wednesday, July 08, 2009 4:44 PM +0000 imf@u.washington.edu wrote:

> Full_Name: Tracy Stenvik
> Version: 2.4.16 (+db-4.7.25pl4)
> OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (140.142.111.41)
>
>
> In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we
> have encountered some serious issues.  A similated production-level load
> that 2.3.39 (with slurpd running) handles easily causes 2.4.16 (and
> 2.4.15) to crash within 5 minutes to 24 hours.

Please re-test with current RE24 from CVS, as numerous fixes have been made 
and 2.4.17 is imminent.  Thanks!

--Quanah


--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 2 imf@u.washington.edu 2009-07-23 18:50:49 UTC

The changes made to 2.4.17 seem to have fixed the crashes in the caching 
module.  Thanks for that.

We still are able to crash 2.4.17, however.  It only happens after a heavy 
load is placed on the producer for >24 hours continuous.  Unfortunately, 
we've not been able to get good tracebacks.  They all look like this,

(gdb) where
#0  0x00869410 in __kernel_vsyscall ()
#1  0x00390d80 in raise () from /lib/libc.so.6
#2  0x00392691 in abort () from /lib/libc.so.6
#3  0x0038a1fb in __assert_fail () from /lib/libc.so.6
#4  0x0808d532 in malloc ()
#5  0x0822c93f in ?? ()
#6  0x0822c933 in ?? ()
#7  0x00000039 in ?? ()
#8  0x0822c908 in ?? ()
#9  0x00000000 in ?? ()

The producer slowly grows its memory footprint.  I can't tell if it's from 
just normal operations or memory leaks.  I suspect it's a little of both. 
The end result, as you can see from the core above, is that there's likely 
some corrupted (or unfreed) memory somewhere.  Sorry I can't nail it down 
further.

The load profile that we placed on the server is documented in my prior 
report.  See above.

---
Tracy Stenvik
University Computing Services 354843.  University of Washington
email: imf@u.washington.edu  voice: (206) 685-3344

Comment 3 Howard Chu 2009-07-24 04:08:49 UTC

imf@u.washington.edu wrote:
> The changes made to 2.4.17 seem to have fixed the crashes in the caching
> module.  Thanks for that.
>
> We still are able to crash 2.4.17, however.  It only happens after a heavy
> load is placed on the producer for>24 hours continuous.  Unfortunately,
> we've not been able to get good tracebacks.  They all look like this,
>
> (gdb) where
> #0  0x00869410 in __kernel_vsyscall ()
> #1  0x00390d80 in raise () from /lib/libc.so.6
> #2  0x00392691 in abort () from /lib/libc.so.6
> #3  0x0038a1fb in __assert_fail () from /lib/libc.so.6
> #4  0x0808d532 in malloc ()
> #5  0x0822c93f in ?? ()
> #6  0x0822c933 in ?? ()
> #7  0x00000039 in ?? ()
> #8  0x0822c908 in ?? ()
> #9  0x00000000 in ?? ()
>
> The producer slowly grows its memory footprint.  I can't tell if it's from
> just normal operations or memory leaks.  I suspect it's a little of both.
> The end result, as you can see from the core above, is that there's likely
> some corrupted (or unfreed) memory somewhere.  Sorry I can't nail it down
> further.

There's not enough information here.

An assert is always accompanied by an error message on stderr; we need to see 
the actual error message.

There are no symbols in the stack trace pertaining to slapd itself. You seem 
to be running a stripped binary. Please provide a trace using a non-stripped 
binary that was compiled with -g (debug symbols enabled).
>
> The load profile that we placed on the server is documented in my prior
> report.  See above.


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 Howard Chu 2009-08-13 02:08:14 UTC

changed notes
changed state Open to Suspended

Comment 5 imf@u.washington.edu 2009-08-18 19:58:33 UTC

Sorry about the stripped trace.  I forgot that the install procedure 
always strips the binaries...

Okay, with our stress profile it takes ~36 hours to fail.  I always
start with a clean db rebuild before each run.  Each failure produces
the same traceback:

(gdb) where
#0  0x00b97410 in __kernel_vsyscall ()
#1  0x00471d80 in raise () from /lib/libc.so.6
#2  0x00473691 in abort () from /lib/libc.so.6
#3  0x0046b1fb in __assert_fail () from /lib/libc.so.6
#4  0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57
#5  0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742
#6  0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378, e=0x3a3dac0, flag=0) at id2entry.c:54
#7  0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378, e=0x3a3dac0) at id2entry.c:90
#8  0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611
#9  0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4, which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669
#10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4, which=op_modify) at backover.c:721
#11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at backover.c:755
#12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:301
#13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175
#14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at connection.c:1115
#15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at connection.c:1251
#16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at tpool.c:685
#17 0x0043749b in start_thread () from /lib/libpthread.so.0
#18 0x0051a42e in clone () from /lib/libc.so.6

---
Tracy Stenvik
University Computing Services 354843.  University of Washington
email: imf@u.washington.edu  voice: (206) 685-3344

On Thu, 23 Jul 2009, Howard Chu wrote:

> imf@u.washington.edu wrote:
>> The changes made to 2.4.17 seem to have fixed the crashes in the caching
>> module.  Thanks for that.
>> 
>> We still are able to crash 2.4.17, however.  It only happens after a heavy
>> load is placed on the producer for>24 hours continuous.  Unfortunately,
>> we've not been able to get good tracebacks.  They all look like this,
>> 
>> (gdb) where
>> #0  0x00869410 in __kernel_vsyscall ()
>> #1  0x00390d80 in raise () from /lib/libc.so.6
>> #2  0x00392691 in abort () from /lib/libc.so.6
>> #3  0x0038a1fb in __assert_fail () from /lib/libc.so.6
>> #4  0x0808d532 in malloc ()
>> #5  0x0822c93f in ?? ()
>> #6  0x0822c933 in ?? ()
>> #7  0x00000039 in ?? ()
>> #8  0x0822c908 in ?? ()
>> #9  0x00000000 in ?? ()
>> 
>> The producer slowly grows its memory footprint.  I can't tell if it's from
>> just normal operations or memory leaks.  I suspect it's a little of both.
>> The end result, as you can see from the core above, is that there's likely
>> some corrupted (or unfreed) memory somewhere.  Sorry I can't nail it down
>> further.
>
> There's not enough information here.
>
> An assert is always accompanied by an error message on stderr; we need to see 
> the actual error message.
>
> There are no symbols in the stack trace pertaining to slapd itself. You seem 
> to be running a stripped binary. Please provide a trace using a non-stripped 
> binary that was compiled with -g (debug symbols enabled).
>> 
>> The load profile that we placed on the server is documented in my prior
>> report.  See above.
>
>
> --
>  -- Howard Chu
>  CTO, Symas Corp.           http://www.symas.com
>  Director, Highland Sun     http://highlandsun.com/hyc/
>  Chief Architect, OpenLDAP  http://www.openldap.org/project/
>

Comment 6 ando@openldap.org 2009-08-18 20:14:47 UTC

> Sorry about the stripped trace.  I forgot that the install procedure
> always strips the binaries...
>
> Okay, with our stress profile it takes ~36 hours to fail.  I always
> start with a clean db rebuild before each run.  Each failure produces
> the same traceback:
>
> (gdb) where
> #0  0x00b97410 in __kernel_vsyscall ()
> #1  0x00471d80 in raise () from /lib/libc.so.6
> #2  0x00473691 in abort () from /lib/libc.so.6
> #3  0x0046b1fb in __assert_fail () from /lib/libc.so.6
> #4  0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57

^^^ this really looks like memory exhaustion while trying to malloc a
large chunk (>4MB).  Can you tell, by printing e->e_name, whether it's
correct that the server was modifying a large entry?

p.

> #5  0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742
> #6  0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378,
> e=0x3a3dac0, flag=0) at id2entry.c:54
> #7  0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378,
> e=0x3a3dac0) at id2entry.c:90
> #8  0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611
> #9  0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4,
> which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669
> #10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4,
> which=op_modify) at backover.c:721
> #11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at
> backover.c:755
> #12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at
> modify.c:301
> #13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175
> #14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at
> connection.c:1115
> #15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at
> connection.c:1251
> #16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at
> tpool.c:685
> #17 0x0043749b in start_thread () from /lib/libpthread.so.0
> #18 0x0051a42e in clone () from /lib/libc.so.6

Comment 7 imf@u.washington.edu 2009-08-18 20:58:10 UTC

On Tue, 18 Aug 2009, masarati@aero.polimi.it wrote:

>> Sorry about the stripped trace.  I forgot that the install procedure
>> always strips the binaries...
>>
>> Okay, with our stress profile it takes ~36 hours to fail.  I always
>> start with a clean db rebuild before each run.  Each failure produces
>> the same traceback:
>>
>> (gdb) where
>> #0  0x00b97410 in __kernel_vsyscall ()
>> #1  0x00471d80 in raise () from /lib/libc.so.6
>> #2  0x00473691 in abort () from /lib/libc.so.6
>> #3  0x0046b1fb in __assert_fail () from /lib/libc.so.6
>> #4  0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57
>
> ^^^ this really looks like memory exhaustion while trying to malloc a
> large chunk (>4MB).  Can you tell, by printing e->e_name, whether it's
> correct that the server was modifying a large entry?

int entry_encode(Entry *e, struct berval *bv)
{
         ber_len_t len, dnlen, ndnlen, i;
         int nattrs, nvals;
         Attribute *a;
         unsigned char *ptr;

         Debug( LDAP_DEBUG_TRACE, "=> entry_encode(0x%08lx): %s\n",
                 (long) e->e_id, e->e_dn, 0 );

         dnlen = e->e_name.bv_len;
         ndnlen = e->e_nname.bv_len;

         entry_partsize( e, &len, &nattrs, &nvals, 1 );

         bv->bv_len = len;
         bv->bv_val = ch_malloc(len);

(gdb) p *e
$8 = {e_id = 343637, e_name = {bv_len = 0, bv_val = 0x822169c ""}, e_nname 
= {bv_len = 0, bv_val = 0x822169c ""},
   e_attrs = 0x208d270c, e_ocflags = 256, e_bv = {bv_len = 6598587, bv_val 
= 0x1ddb5008 "\021"}, e_private = 0x9ad5ce0}
(gdb) p *bv
$2 = {bv_len = 4436293, bv_val = 0x1 <Address 0x1 out of bounds>}
(gdb) p nattrs
$6 = 16
(gdb) p nvals
$7 = 270294

Well, according to e->e_name, e_dn (e_name.bv_val) is empty, so this is 
likely bogus.  We do have some large member entries (>100K) in our test 
profile, however none add up to anywhere near >4M.

Tracy

>
> p.
>
>> #5  0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742
>> #6  0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378,
>> e=0x3a3dac0, flag=0) at id2entry.c:54
>> #7  0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378,
>> e=0x3a3dac0) at id2entry.c:90
>> #8  0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611
>> #9  0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4,
>> which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669
>> #10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4,
>> which=op_modify) at backover.c:721
>> #11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at
>> backover.c:755
>> #12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at
>> modify.c:301
>> #13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175
>> #14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at
>> connection.c:1115
>> #15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at
>> connection.c:1251
>> #16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at
>> tpool.c:685
>> #17 0x0043749b in start_thread () from /lib/libpthread.so.0
>> #18 0x0051a42e in clone () from /lib/libc.so.6
>
>
>

Comment 8 richton@nbcs.rutgers.edu 2009-08-24 17:28:31 UTC

What does this look with top and/or in dmesg over the run time? Is it a 
simple out-of-memory? Definitely a bit of a gross method, but what's ls 
-lh core show for size?

(If so, is it warranted given your load or is there a leak, etc etc...and 
of course make sure you're up to date on the surrounding packages, 
OpenLDAP isn't the only thing that can leak.)

Comment 9 imf@u.washington.edu 2009-09-14 20:35:22 UTC

On Mon, 24 Aug 2009, Aaron Richton wrote:

> What does this look with top and/or in dmesg over the run time? Is it a 
> simple out-of-memory? Definitely a bit of a gross method, but what's ls -lh 
> core show for size?

I haven't been able to monitor it constantly over its lifespan, as it 
takes anywhere from 24 to 36 hours of our load before failure, but what I 
have observed is that slapd continually grows (slowly) over the course of 
the test until it can't malloc any more.  Core sizes run around 640Mb. 
The largest '__db' element is 1.9Gb.

So, slapd behaves as if there's a slow memory leak, somewhere.  Unless the 
delta syncrepl mechanism (on producer) has an insatiable memory appetite.
Our test suite is heavy on updates, including mods to large-memeber groups
(with 120K+ members).

The consumer slapd never dies, and more or less keeps up with the change 
load from the producer.

>
> (If so, is it warranted given your load or is there a leak, etc etc...and of 
> course make sure you're up to date on the surrounding packages, OpenLDAP 
> isn't the only thing that can leak.)
>

openldap-2.4.17
db-4.7.25 (patch level 4)
openssl 0.9.8e
Red Hat Enterprise Linux Server release 5.4

Tracy

Comment 10 Joacim Breiler 2009-09-17 13:21:07 UTC

We are trying out the 2.4.18 release with a syncrepl Multi-Master setup 
and are experiencing similar problems as mentioned above.

It will shut down with a segfault after heavy processing/updating of 
about 3000-17000 entries. We can reproduce this in about 10-20 minutes 
of heavy load. However, in this short execution time we can't find any 
signs of memory leakage.

Unfortunatly I can't get any good tracebacks from gdb. I've tried 
compiling openldap with:
export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with 
/configure. Any pointers?

This is the traceback I get, but I guess it isn't very helpful without 
the complete debugging information:

(gdb) where
#0  0x00007fc7e24fb8eb in syncprov_op_mod (op=0x7fc7aedb37d0, rs=<value 
optimized out>) at syncprov.c:1970
#1  0x000000000049531a in overlay_op_walk ()
#2  0x0000000000495e7a in ?? ()
#3  0x000000000044a2e2 in fe_op_modify ()
#4  0x000000000044ac67 in do_modify ()
#5  0x00000000004321bf in ?? ()
#6  0x0000000000432e6c in ?? ()
#7  0x00000000004dd960 in ?? ()
#8  0x00007fc7e2e9b3ea in start_thread () from /lib/libpthread.so.0
#9  0x00007fc7e29f3cbd in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

Could these problems be related? And is there something we can do to 
help resolve it?

Regards,
Joacim Breiler



Configuration:

dn: olcDatabase={1}hdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcHdbConfig
olcDatabase: {1}hdb
olcDbDirectory: /srv/ldap
olcSuffix: dc=hgo,dc=se
olcAccess: {0}to attrs=userPassword,shadowLastChange by dn="cn=admin,dc=hgo,
 dc=se" write by anonymous auth by self write by * none
olcLastMod: TRUE
olcRootDN: cn=admin,dc=hgo,dc=se
olcRootPW: xxxxxxxx
olcSizeLimit: -1
olcSyncrepl: {0}rid=004 provider=ldap://diablo binddn="cn=admin,dc=hgo,dc=se
 " bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" ty
 pe=refreshAndPersist retry="5 5 300 5" timeout=1
olcSyncrepl: {1}rid=003 provider=ldap://mephisto binddn="cn=admin,dc=hgo,dc=
 se" bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se"
 type=refreshAndPersist retry="5 5 300 5" timeout=1
olcMirrorMode: TRUE
olcDbCheckpoint: 2048 5
olcDbIDLcacheSize: 90000
olcDbIndex: objectClass,entryCSN,entryUUID,cn eq
olcDbCacheSize: 10000
olcDbConfig: {0}set_cachesize 1 0 0
olcDbConfig: {1}set_lk_max_objects 32000
olcDbConfig: {2}set_lk_max_locks 32000
olcDbConfig: {3}set_lk_max_lockers 1500
olcDbConfig: {4}set_flags DB_LOG_AUTOREMOVE
olcDbConfig: {5}set_lg_bsize 104857600
olcDbConfig: {6}set_lg_dir /srv/ldap

Comment 11 Quanah Gibson-Mount 2009-09-17 15:16:15 UTC

--On Thursday, September 17, 2009 1:21 PM +0000 joacim.breiler@hgo.se wrote:

> We are trying out the 2.4.18 release with a syncrepl Multi-Master setup
> and are experiencing similar problems as mentioned above.
>
> It will shut down with a segfault after heavy processing/updating of
> about 3000-17000 entries. We can reproduce this in about 10-20 minutes
> of heavy load. However, in this short execution time we can't find any
> signs of memory leakage.
>
> Unfortunatly I can't get any good tracebacks from gdb. I've tried
> compiling openldap with:
> export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with
> /configure. Any pointers?
>
> This is the traceback I get, but I guess it isn't very helpful without
> the complete debugging information:

When you ran "make install" did you put STRIP=""?

Like

make install STRIP=""

otherwise, the slapd binary has all its debugging information stripped out.

--Quanah

--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 12 imf@u.washington.edu 2009-09-17 16:13:13 UTC

We build with,

   % CPPFLAGS="-I/usr/local/krb5/include -I/usr/local/include/sasl" \
     CFLAGS="-g" ./configure ...
   % make

The 'CFLAGS' environmental will make sure the '-g' is injected into all 
generated makefiles.  You can add -D_GNU_SOURCE there as well.

We experienced the problems you described with <2.4.17.  2.4.17 fixed many 
of the segfault causes.  We didn't repeat our stress test with multimaster 
and 2.4.17+ though, since we determined that deltasync was good enough for 
our purposes.  It may very well be that 2.4.17+ would crash in multimaster 
mode with our stress test.  The traceback you provided looks very similar 
to what we were seeing.

If I have time I could run a test with 2.4.18 and multimaster and see what 
the results are...


---
Tracy Stenvik
University Computing Services 354843.  University of Washington
email: imf@u.washington.edu  voice: (206) 685-3344

On Thu, 17 Sep 2009, Joacim Breiler wrote:

> We are trying out the 2.4.18 release with a syncrepl Multi-Master setup and 
> are experiencing similar problems as mentioned above.
>
> It will shut down with a segfault after heavy processing/updating of about 
> 3000-17000 entries. We can reproduce this in about 10-20 minutes of heavy 
> load. However, in this short execution time we can't find any signs of memory 
> leakage.
>
> Unfortunatly I can't get any good tracebacks from gdb. I've tried compiling 
> openldap with:
> export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with /configure. 
> Any pointers?
>
> This is the traceback I get, but I guess it isn't very helpful without the 
> complete debugging information:
>
> (gdb) where
> #0  0x00007fc7e24fb8eb in syncprov_op_mod (op=0x7fc7aedb37d0, rs=<value 
> optimized out>) at syncprov.c:1970
> #1  0x000000000049531a in overlay_op_walk ()
> #2  0x0000000000495e7a in ?? ()
> #3  0x000000000044a2e2 in fe_op_modify ()
> #4  0x000000000044ac67 in do_modify ()
> #5  0x00000000004321bf in ?? ()
> #6  0x0000000000432e6c in ?? ()
> #7  0x00000000004dd960 in ?? ()
> #8  0x00007fc7e2e9b3ea in start_thread () from /lib/libpthread.so.0
> #9  0x00007fc7e29f3cbd in clone () from /lib/libc.so.6
> #10 0x0000000000000000 in ?? ()
> (gdb)
>
> Could these problems be related? And is there something we can do to help 
> resolve it?
>
> Regards,
> Joacim Breiler
>
>
>
> Configuration:
>
> dn: olcDatabase={1}hdb,cn=config
> objectClass: olcDatabaseConfig
> objectClass: olcHdbConfig
> olcDatabase: {1}hdb
> olcDbDirectory: /srv/ldap
> olcSuffix: dc=hgo,dc=se
> olcAccess: {0}to attrs=userPassword,shadowLastChange by dn="cn=admin,dc=hgo,
> dc=se" write by anonymous auth by self write by * none
> olcLastMod: TRUE
> olcRootDN: cn=admin,dc=hgo,dc=se
> olcRootPW: xxxxxxxx
> olcSizeLimit: -1
> olcSyncrepl: {0}rid=004 provider=ldap://diablo binddn="cn=admin,dc=hgo,dc=se
> " bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" ty
> pe=refreshAndPersist retry="5 5 300 5" timeout=1
> olcSyncrepl: {1}rid=003 provider=ldap://mephisto binddn="cn=admin,dc=hgo,dc=
> se" bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se"
> type=refreshAndPersist retry="5 5 300 5" timeout=1
> olcMirrorMode: TRUE
> olcDbCheckpoint: 2048 5
> olcDbIDLcacheSize: 90000
> olcDbIndex: objectClass,entryCSN,entryUUID,cn eq
> olcDbCacheSize: 10000
> olcDbConfig: {0}set_cachesize 1 0 0
> olcDbConfig: {1}set_lk_max_objects 32000
> olcDbConfig: {2}set_lk_max_locks 32000
> olcDbConfig: {3}set_lk_max_lockers 1500
> olcDbConfig: {4}set_flags DB_LOG_AUTOREMOVE
> olcDbConfig: {5}set_lg_bsize 104857600
> olcDbConfig: {6}set_lg_dir /srv/ldap
>
>
>

Comment 13 frank.offermanns@caseris.de 2009-09-28 12:10:43 UTC

Hello,

i have the same problem with 2 Ubuntu 9.04 Systems with the current build 
from HEAD. I created 20000 entries at server1. Then i created another 
20000 entries at server1 while reading the old and new entries from 
server1 and server2. 
Here is the backtrace:

slapd: pthread_mutex_lock.c:289: __pthread_mutex_lock: Assertion `(-(e)) 
!= 3 || !robust' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x923f9b90 (LWP 17860)]
0xb7fe0430 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7fe0430 in __kernel_vsyscall ()
#1  0xb7b756d0 in raise () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7b77098 in abort () from /lib/tls/i686/cmov/libc.so.6
#3  0xb7b6e5ce in __assert_fail () from /lib/tls/i686/cmov/libc.so.6
#4  0xb7e5df29 in pthread_mutex_lock () from 
/lib/tls/i686/cmov/libpthread.so.0
#5  0x0813fccb in ldap_pvt_thread_mutex_lock (mutex=0x8bc64dc) at 
thr_posix.c:296
#6  0xb7998e8e in syncprov_op_mod (op=0x8bc3690, rs=0x923f911c) at 
syncprov.c:1965
#7  0x08105bfd in overlay_op_walk (op=0x8bc3690, rs=0x923f911c, 
which=op_modify, oi=0x87e9d18, on=0x87eeef0) at backover.c:659
#8  0x08105e9a in over_op_func (op=0x8bc3690, rs=0x923f911c, 
which=op_modify) at backover.c:721
#9  0x08105fc7 in over_op_modify (op=0x8bc3690, rs=0x923f911c) at 
backover.c:760
#10 0x080a117e in fe_op_modify (op=0x8bc3690, rs=0x923f911c) at 
modify.c:301
#11 0x080a0a6d in do_modify (op=0x8bc3690, rs=0x923f911c) at modify.c:175
#12 0x08082a43 in connection_operation (ctx=0x923f9210, arg_v=0x8bc3690) 
at connection.c:1123
#13 0x08082f8b in connection_read_thread (ctx=0x923f9210, argv=0x12) at 
connection.c:1259
#14 0x0813eb0e in ldap_int_thread_pool_wrapper (xpool=0x87af8e8) at 
tpool.c:685
#15 0xb7e5c4ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#16 0xb7c2e49e in clone () from /lib/tls/i686/cmov/libc.so.6

Please let me know if i can provide additional information for you. (But 
please tell me how to do so, my gdb knowledge is fairly poor)

Regards,
Frank Offermanns

Comment 14 Joacim Breiler 2009-09-29 08:43:35 UTC

Thanks for your fast replies! The 'make install STRIP=""' did the trick.

I ran the server for about a week with no segfaults, then I decided that 
I couldn't wait for it any longer. So I connected to the server using an 
LDAP browser, and then after about ten minutes it segfaulted. So to 
reproduce I need one connection that is reading/updating/creating 
entries. And one connection that does an occasional read, otherwise the 
server will keep on running.

The traceback from gdb gives:

0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, rs=0x44688c80) 
at syncprov.c:1970
1970                        for ( m2 = mt->mt_mods; m2->mi_next != mi;
(gdb) where
#0  0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, 
rs=0x44688c80) at syncprov.c:1970
#1  0x00000000004c438a in overlay_op_walk (op=0x7f9688000ac0, 
rs=0x44688c80, which=op_modify, oi=0x12293c0, on=0x12295a0) at 
backover.c:659
#2  0x00000000004c4667 in over_op_func (op=0x7f9688000ac0, 
rs=0x44688c80, which=op_modify) at backover.c:721
#3  0x00000000004c47ac in over_op_modify (op=0x7f9688000ac0, 
rs=0x44688c80) at backover.c:760
#4  0x000000000045bb21 in fe_op_modify (op=0x7f9688000ac0, 
rs=0x44688c80) at modify.c:301
#5  0x000000000045b430 in do_modify (op=0x7f9688000ac0, rs=0x44688c80) 
at modify.c:175
#6  0x000000000043c3ec in connection_operation (ctx=0x44688de0, 
arg_v=0x7f9688000ac0) at connection.c:1123
#7  0x000000000043c975 in connection_read_thread (ctx=0x44688de0, 
argv=0x17) at connection.c:1259
#8  0x0000000000525193 in ldap_int_thread_pool_wrapper (xpool=0x1190290) 
at tpool.c:685
#9  0x00007f96c49613ea in start_thread () from /lib/libpthread.so.0
#10 0x00007f96c44b9cbd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

/J

Comment 15 Howard Chu 2009-11-15 19:56:38 UTC

joacim.breiler@hgo.se wrote:
> Thanks for your fast replies! The 'make install STRIP=""' did the trick.
> 
> I ran the server for about a week with no segfaults, then I decided that 
> I couldn't wait for it any longer. So I connected to the server using an 
> LDAP browser, and then after about ten minutes it segfaulted. So to 
> reproduce I need one connection that is reading/updating/creating 
> entries. And one connection that does an occasional read, otherwise the 
> server will keep on running.
> 
> The traceback from gdb gives:

This trace looks like the problem fixed in ITS#6335. The fix is in CVS HEAD
and RE24 (2.4.20 pre-release).
> 
> 0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, rs=0x44688c80) 
> at syncprov.c:1970
> 1970                        for ( m2 = mt->mt_mods; m2->mi_next != mi;
> (gdb) where
> #0  0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, 
> rs=0x44688c80) at syncprov.c:1970
> #1  0x00000000004c438a in overlay_op_walk (op=0x7f9688000ac0, 
> rs=0x44688c80, which=op_modify, oi=0x12293c0, on=0x12295a0) at 
> backover.c:659
> #2  0x00000000004c4667 in over_op_func (op=0x7f9688000ac0, 
> rs=0x44688c80, which=op_modify) at backover.c:721
> #3  0x00000000004c47ac in over_op_modify (op=0x7f9688000ac0, 
> rs=0x44688c80) at backover.c:760
> #4  0x000000000045bb21 in fe_op_modify (op=0x7f9688000ac0, 
> rs=0x44688c80) at modify.c:301
> #5  0x000000000045b430 in do_modify (op=0x7f9688000ac0, rs=0x44688c80) 
> at modify.c:175
> #6  0x000000000043c3ec in connection_operation (ctx=0x44688de0, 
> arg_v=0x7f9688000ac0) at connection.c:1123
> #7  0x000000000043c975 in connection_read_thread (ctx=0x44688de0, 
> argv=0x17) at connection.c:1259
> #8  0x0000000000525193 in ldap_int_thread_pool_wrapper (xpool=0x1190290) 
> at tpool.c:685
> #9  0x00007f96c49613ea in start_thread () from /lib/libpthread.so.0
> #10 0x00007f96c44b9cbd in clone () from /lib/libc.so.6
> #11 0x0000000000000000 in ?? ()
> 
> /J

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 16 Howard Chu 2009-11-15 19:56:52 UTC

changed notes
changed state Suspended to Feedback

Comment 17 Howard Chu 2009-12-09 01:31:25 UTC

changed notes
changed state Feedback to Closed

Comment 18 OpenLDAP project 2014-08-01 21:03:40 UTC

Same as ITS#6335