[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#6200) slapd crashes under load w/ syncrepl



Full_Name: Tracy Stenvik
Version: 2.4.16 (+db-4.7.25pl4)
OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (140.142.111.41)


In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we have
encountered some serious issues.  A similated production-level load that
2.3.39 (with slurpd running) handles easily causes 2.4.16 (and 2.4.15) to
crash within 5 minutes to 24 hours.

Without replication active, 2.4.16 survived 24 hours under load.  As soon as
replication was activated however, the server crashed within 5 minutes.
We've tried several syncrepl variants (as described below) but none appear
stable.

In analyzing the various crash dumps, most of the trouble appears to lie in
the caching code.  It has a nasty habit of returing nulls when nulls are not
expected.

Our simulated load consists of an arbitrary mix of adds/modifies/searches to 
various objects in the directory.  This load is defined as "N" instances of 
the "Stressor" test, as discussed on the "GDS Stress Test Findings" page at,

  http://wiki.cac.washington.edu/display/infra/GDS+Stress+Test+Findings

Following, you will find a series of crash dumps, along with a few comments.
Also included are sections the various slapd.confs we were testing with.

Please contact me if you require additional information.



delta syncrepl crash
--------------------
(gdb) where
#0  0xb7b7abab in strncmp () from /lib/libc.so.6
#1  0x08143ea4 in bdb_rdn_cmp (v_e1=0xf069be8, v_e2=0xf73cb48) at cache.c:301
#2  0x081d00ee in avl_insert (root=0x8622088, data=0xf069be8, fcmp=0x8143e55
<bdb_rdn_cmp>, fdup=0x81d10a0 <avl_dup_error>) at avl.c:95
#3  0x08144692 in hdb_cache_find_parent (op=0xf95ea78, txn=0xe7def98, id=343637,
res=0x14b3ade0) at cache.c:593
#4  0x08144dc9 in hdb_cache_find_id (op=0xf95ea78, tid=0xe7def98, id=343637,
eip=0x14b3ade0, flag=0, lock=0x14b3adc4) at cache.c:905
#5  0x0810b6a0 in hdb_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:705
#6  0x080e9b02 in overlay_op_walk (op=0xf95ea78, rs=0x14bfc0e8, which=op_search,
oi=0x85aeed8, on=0x0) at backover.c:669
#7  0x080e9cb7 in over_op_func (op=0xf95ea78, rs=0x14bfc0e8, which=op_search) at
backover.c:721
#8  0x080e9d3b in over_op_search (op=0xf95ea78, rs=0x14bfc0e8) at
backover.c:743
#9  0x0806f0b3 in fe_op_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:366
#10 0x0806ea53 in do_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:217
#11 0x0806b959 in connection_operation (ctx=0x14bfc1d0, arg_v=0xf95ea78) at
connection.c:1097
#12 0x0806be33 in connection_read_thread (ctx=0x14bfc1d0, argv=0x23) at
connection.c:1223
#13 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8561178) at tpool.c:663
#14 0xb7e4d49b in start_thread () from /lib/libpthread.so.0
#15 0xb7bdc42e in clone () from /lib/libc.so.6
(gdb) p *(EntryInfo *)v_e1
$2 = {bei_parent = 0x8622058, bei_id = 343637, bei_lockpad = 0, bei_state = 132,
bei_finders = 0, bei_nrdn = {bv_len = 45, 
    bv_val = 0xf7af1f8 "serialNumber=40442595174b5ebf53167252b5cbcc34"}, bei_rdn
= {bv_len = 45, 
    bv_val = 0xf36d700 "serialNumber=40442595174B5EBF53167252B5CBCC34"},
bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, 
  bei_kids_mutex = {__data = {__lock = 1, __count = 0, __owner = 32550, __kind =
0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, 
    __size = "\001\000\000\000\000\000\000\000&\177\000\000\000\000\000\000\001\000\000\000\000\000\000",
__align = 1}, bei_lrunext = 0x8978ad0, 
  bei_lruprev = 0x0}
(gdb) p *(EntryInfo *)v_e2
$3 = {bei_parent = 0x0, bei_id = 410257, bei_lockpad = 0, bei_state = 0,
bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0x0}, bei_rdn = {
    bv_len = 45, bv_val = 0x0}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0,
bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list =
{__next = 0x0}}}, __size = '\0' <repeats 23 times>, __align = 0}, 
  bei_lrunext = 0x8978ad0, bei_lruprev = 0x0}

In bdb_rdn_cmp() you're comparing e1->bei_nrdn.bv_val to e2->bei_nrdn.bv_val
but the latter is null even though e2->bei_nrdn.bv_len is not zero.


standard syncrepl crash
-----------------------
(gdb) where
#0  0xb7d9f300 in pthread_mutex_lock () from /lib/libpthread.so.0
#1  0x081d8645 in ldap_pvt_thread_mutex_lock (mutex=0x34) at thr_posix.c:296
#2  0x08145bc0 in bdb_cache_delete_internal (cache=0xa25cc04, e=0x147e6d38,
decr=0) at cache.c:1372
#3  0x08144b08 in hdb_cache_lru_purge (bdb=0xa25cbc8) at cache.c:776
#4  0x081452d1 in hdb_cache_find_id (op=0x11923ca8, tid=0x1316aa48, id=280869,
eip=0x2bffcbfc, flag=0, lock=0x2bffcc54) at cache.c:1053
#5  0x0814b2f7 in hdb_dn2entry (op=0x11923ca8, tid=0x1316aa48, dn=0xa25e3bc,
e=0x2bffcc64, matched=0, lock=0x2bffcc54) at dn2entry.c:67
#6  0x08151845 in hdb_entry_get (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0,
at=0xa23ebb0, rw=0, ent=0x2bffceac) at id2entry.c:354
#7  0x080e936d in overlay_entry_get_ov (op=0x11923ca8, dn=0xa25e3bc,
oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac, on=0x0) at backover.c:364
#8  0x080e941f in over_entry_get_rw (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0,
ad=0xa23ebb0, rw=0, e=0x2bffceac) at backover.c:396
#9  0x0807cfc3 in be_entry_get_rw (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0,
at=0xa23ebb0, rw=0, e=0x2bffceac) at backend.c:1346
#10 0x0807d1a2 in fe_acl_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc,
op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1402
#11 0x0807d8c3 in backend_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc,
op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1597
#12 0x080d32e2 in limits_get (op=0x11923ca8, limit=0x11923cdc) at limits.c:117
#13 0x080d588d in limits_check (op=0x11923ca8, rs=0x2bffe0e8) at limits.c:1106
#14 0x0806f095 in fe_op_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:364
#15 0x0806ea53 in do_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:217
#16 0x0806b959 in connection_operation (ctx=0x2bffe1d0, arg_v=0x11923ca8) at
connection.c:1097
#17 0x0806be33 in connection_read_thread (ctx=0x2bffe1d0, argv=0x1d) at
connection.c:1223
#18 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0xa211178) at tpool.c:663
#19 0xb7d9d49b in start_thread () from /lib/libpthread.so.0
#20 0xb7b2c42e in clone () from /lib/libc.so.6
(gdb) p *e->bei_parent
Cannot access memory at address 0x0

In bdb_cache_delete_internal(), e->bei_parent is null.


2-way multimaster syncrepl crash
--------------------------------
(gdb) where
#0  0x08149c1a in hdb_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:217
#1  0x080e9b02 in overlay_op_walk (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete,
oi=0x88b4550, on=0x0) at backover.c:669
#2  0x080e9cb7 in over_op_func (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete) at
backover.c:721
#3  0x080e9de5 in over_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at
backover.c:773
#4  0x0808ae88 in fe_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:174
#5  0x0808abdb in do_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:95
#6  0x0806b959 in connection_operation (ctx=0x2ecfc1d0, arg_v=0xc01caa8) at
connection.c:1097
#7  0x0806be33 in connection_read_thread (ctx=0x2ecfc1d0, argv=0x22) at
connection.c:1223
#8  0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8867178) at tpool.c:663
#9  0xb7e4749b in start_thread () from /lib/libpthread.so.0
#10 0xb7bd642e in clone () from /lib/libc.so.6
(gdb) p eip
$1 = (EntryInfo *) 0x0

In hdb_delete(), eip is null.


2-way multimaster syncrepl crash
--------------------------------
(gdb) where
#0  0x080956db in slap_acl_get (a=0x8b9ba80, count=0x37f5d824, op=0x8c5d948,
e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, matches=0x37f5d1c4, state=0x37f5d97c) at
acl.c:599
#1  0x080949b9 in slap_access_allowed (op=0x8c5d948, e=0x37f5dad0,
desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at
acl.c:236
#2  0x08094cd5 in fe_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0,
val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:360
#3  0x0809503c in access_allowed_mask (op=0x8c5d948, e=0x37f5dad0,
desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x0) at
acl.c:462
#4  0x08098b8f in acl_check_modlist (op=0x8c5d948, e=0x37f5dad0,
mlist=0x34afa7c8) at acl.c:1986
#5  0x0810373a in hdb_modify_internal (op=0x8c5d948, tid=0x961aa10,
modlist=0x34afa7c8, e=0x37f5dad0, text=0x37f5f0fc, textbuf=0x37f5db18 "",
textlen=256) at modify.c:88
#6  0x08105292 in hdb_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:590
#7  0x080e9b02 in overlay_op_walk (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify,
oi=0x8bbb550, on=0x0) at backover.c:669
#8  0x080e9cb7 in over_op_func (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify) at
backover.c:721
#9  0x080e9d7f in over_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at
backover.c:755
#10 0x08088a89 in fe_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:301
#11 0x080884c8 in do_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:175
#12 0x0806b959 in connection_operation (ctx=0x37f5f1d0, arg_v=0x8c5d948) at
connection.c:1097
#13 0x0806be33 in connection_read_thread (ctx=0x37f5f1d0, argv=0x22) at
connection.c:1223
#14 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8b6e178) at tpool.c:663
#15 0xb7e6749b in start_thread () from /lib/libpthread.so.0
#16 0xb7bf642e in clone () from /lib/libc.so.6
(gdb) p  e->e_nname
$3 = {bv_len = 77, bv_val = 0x0}

In slap_acl_get(), e->e_nname.bv_val (e->e_ndn) is null.



delta syncrepl slapd.conf
-------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Replication db ====
database        hdb
suffix          "cn=log"
directory       /data/au20aa_log
rootdn          "cn=Manager,dc=washington,dc=edu"
#index           objectClass     eq
#index           entryUUID,entryCSN      eq
index           default eq
index           entryCSN,objectClass,reqEnd,reqResult,reqStart

overlay syncprov
syncprov-reloadhint true
syncprov-nopresent true

# ===== Primary db ===
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60

# accesslog overlay definitions for primary database
overlay accesslog
logdb cn=log
logops writes
logsuccess true

# scan the accesslog DB every day, and purge entries older than 7 days
logpurge 07+00:00 01+00:00


standard syncrepl slapd.conf
----------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Primary db =====
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60



2-way multimaster syncrepl slapd.conf
-------------------------------------
serverID 1 ldap://rufus01.u.washington.edu:290
serverID 2 ldap://rufus04.u.washington.edu:290

gentlehup       on
idletimeout     600
sizelimit       20000
sortvals        member

# ===== Primary db =====
database        hdb
suffix          "dc=washington,dc=edu"
rootdn          "cn=Manager,dc=washington,dc=edu"
include         /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf
directory       /data/au20aa
index           entryCSN eq
index           entryUUID eq

checkpoint      10000 15
cachesize       90000
idlcachesize    180000

overlay syncprov
#syncprov-sessionlog 100
syncprov-checkpoint 1000 60

###
### ----------- Replication
###

syncrepl        rid=101
                provider=ldap://rufus01.u.washington.edu:290
                searchbase="dc=washington,dc=edu"
                bindmethod=sasl
                saslmech=external
                starttls=yes
                tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt
                tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key
                filter="(objectClass=*)"
                attrs="*,+"
                schemachecking=off
                scope=sub
                type=refreshAndPersist
                retry="60 +"

syncrepl        rid=102
                provider=ldap://rufus04.u.washington.edu:290
                searchbase="dc=washington,dc=edu"
                bindmethod=sasl
                saslmech=external
                starttls=yes
                tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt
                tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key
                filter="(objectClass=*)"
                attrs="*,+"
                schemachecking=off
                scope=sub
                type=refreshAndPersist
                retry="60 +"

mirrormode      yes