Full_Name: Tracy Stenvik Version: 2.4.16 (+db-4.7.25pl4) OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga) URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (140.142.111.41) In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we have encountered some serious issues. A similated production-level load that 2.3.39 (with slurpd running) handles easily causes 2.4.16 (and 2.4.15) to crash within 5 minutes to 24 hours. Without replication active, 2.4.16 survived 24 hours under load. As soon as replication was activated however, the server crashed within 5 minutes. We've tried several syncrepl variants (as described below) but none appear stable. In analyzing the various crash dumps, most of the trouble appears to lie in the caching code. It has a nasty habit of returing nulls when nulls are not expected. Our simulated load consists of an arbitrary mix of adds/modifies/searches to various objects in the directory. This load is defined as "N" instances of the "Stressor" test, as discussed on the "GDS Stress Test Findings" page at, http://wiki.cac.washington.edu/display/infra/GDS+Stress+Test+Findings Following, you will find a series of crash dumps, along with a few comments. Also included are sections the various slapd.confs we were testing with. Please contact me if you require additional information. delta syncrepl crash -------------------- (gdb) where #0 0xb7b7abab in strncmp () from /lib/libc.so.6 #1 0x08143ea4 in bdb_rdn_cmp (v_e1=0xf069be8, v_e2=0xf73cb48) at cache.c:301 #2 0x081d00ee in avl_insert (root=0x8622088, data=0xf069be8, fcmp=0x8143e55 <bdb_rdn_cmp>, fdup=0x81d10a0 <avl_dup_error>) at avl.c:95 #3 0x08144692 in hdb_cache_find_parent (op=0xf95ea78, txn=0xe7def98, id=343637, res=0x14b3ade0) at cache.c:593 #4 0x08144dc9 in hdb_cache_find_id (op=0xf95ea78, tid=0xe7def98, id=343637, eip=0x14b3ade0, flag=0, lock=0x14b3adc4) at cache.c:905 #5 0x0810b6a0 in hdb_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:705 #6 0x080e9b02 in overlay_op_walk (op=0xf95ea78, rs=0x14bfc0e8, which=op_search, oi=0x85aeed8, on=0x0) at backover.c:669 #7 0x080e9cb7 in over_op_func (op=0xf95ea78, rs=0x14bfc0e8, which=op_search) at backover.c:721 #8 0x080e9d3b in over_op_search (op=0xf95ea78, rs=0x14bfc0e8) at backover.c:743 #9 0x0806f0b3 in fe_op_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:366 #10 0x0806ea53 in do_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:217 #11 0x0806b959 in connection_operation (ctx=0x14bfc1d0, arg_v=0xf95ea78) at connection.c:1097 #12 0x0806be33 in connection_read_thread (ctx=0x14bfc1d0, argv=0x23) at connection.c:1223 #13 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8561178) at tpool.c:663 #14 0xb7e4d49b in start_thread () from /lib/libpthread.so.0 #15 0xb7bdc42e in clone () from /lib/libc.so.6 (gdb) p *(EntryInfo *)v_e1 $2 = {bei_parent = 0x8622058, bei_id = 343637, bei_lockpad = 0, bei_state = 132, bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0xf7af1f8 "serialNumber=40442595174b5ebf53167252b5cbcc34"}, bei_rdn = {bv_len = 45, bv_val = 0xf36d700 "serialNumber=40442595174B5EBF53167252B5CBCC34"}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 1, __count = 0, __owner = 32550, __kind = 0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size = "\001\000\000\000\000\000\000\000&\177\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 1}, bei_lrunext = 0x8978ad0, bei_lruprev = 0x0} (gdb) p *(EntryInfo *)v_e2 $3 = {bei_parent = 0x0, bei_id = 410257, bei_lockpad = 0, bei_state = 0, bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0x0}, bei_rdn = { bv_len = 45, bv_val = 0x0}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\0' <repeats 23 times>, __align = 0}, bei_lrunext = 0x8978ad0, bei_lruprev = 0x0} In bdb_rdn_cmp() you're comparing e1->bei_nrdn.bv_val to e2->bei_nrdn.bv_val but the latter is null even though e2->bei_nrdn.bv_len is not zero. standard syncrepl crash ----------------------- (gdb) where #0 0xb7d9f300 in pthread_mutex_lock () from /lib/libpthread.so.0 #1 0x081d8645 in ldap_pvt_thread_mutex_lock (mutex=0x34) at thr_posix.c:296 #2 0x08145bc0 in bdb_cache_delete_internal (cache=0xa25cc04, e=0x147e6d38, decr=0) at cache.c:1372 #3 0x08144b08 in hdb_cache_lru_purge (bdb=0xa25cbc8) at cache.c:776 #4 0x081452d1 in hdb_cache_find_id (op=0x11923ca8, tid=0x1316aa48, id=280869, eip=0x2bffcbfc, flag=0, lock=0x2bffcc54) at cache.c:1053 #5 0x0814b2f7 in hdb_dn2entry (op=0x11923ca8, tid=0x1316aa48, dn=0xa25e3bc, e=0x2bffcc64, matched=0, lock=0x2bffcc54) at dn2entry.c:67 #6 0x08151845 in hdb_entry_get (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0, at=0xa23ebb0, rw=0, ent=0x2bffceac) at id2entry.c:354 #7 0x080e936d in overlay_entry_get_ov (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac, on=0x0) at backover.c:364 #8 0x080e941f in over_entry_get_rw (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac) at backover.c:396 #9 0x0807cfc3 in be_entry_get_rw (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0, at=0xa23ebb0, rw=0, e=0x2bffceac) at backend.c:1346 #10 0x0807d1a2 in fe_acl_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc, op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1402 #11 0x0807d8c3 in backend_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc, op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1597 #12 0x080d32e2 in limits_get (op=0x11923ca8, limit=0x11923cdc) at limits.c:117 #13 0x080d588d in limits_check (op=0x11923ca8, rs=0x2bffe0e8) at limits.c:1106 #14 0x0806f095 in fe_op_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:364 #15 0x0806ea53 in do_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:217 #16 0x0806b959 in connection_operation (ctx=0x2bffe1d0, arg_v=0x11923ca8) at connection.c:1097 #17 0x0806be33 in connection_read_thread (ctx=0x2bffe1d0, argv=0x1d) at connection.c:1223 #18 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0xa211178) at tpool.c:663 #19 0xb7d9d49b in start_thread () from /lib/libpthread.so.0 #20 0xb7b2c42e in clone () from /lib/libc.so.6 (gdb) p *e->bei_parent Cannot access memory at address 0x0 In bdb_cache_delete_internal(), e->bei_parent is null. 2-way multimaster syncrepl crash -------------------------------- (gdb) where #0 0x08149c1a in hdb_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:217 #1 0x080e9b02 in overlay_op_walk (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete, oi=0x88b4550, on=0x0) at backover.c:669 #2 0x080e9cb7 in over_op_func (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete) at backover.c:721 #3 0x080e9de5 in over_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at backover.c:773 #4 0x0808ae88 in fe_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:174 #5 0x0808abdb in do_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:95 #6 0x0806b959 in connection_operation (ctx=0x2ecfc1d0, arg_v=0xc01caa8) at connection.c:1097 #7 0x0806be33 in connection_read_thread (ctx=0x2ecfc1d0, argv=0x22) at connection.c:1223 #8 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8867178) at tpool.c:663 #9 0xb7e4749b in start_thread () from /lib/libpthread.so.0 #10 0xb7bd642e in clone () from /lib/libc.so.6 (gdb) p eip $1 = (EntryInfo *) 0x0 In hdb_delete(), eip is null. 2-way multimaster syncrepl crash -------------------------------- (gdb) where #0 0x080956db in slap_acl_get (a=0x8b9ba80, count=0x37f5d824, op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, matches=0x37f5d1c4, state=0x37f5d97c) at acl.c:599 #1 0x080949b9 in slap_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:236 #2 0x08094cd5 in fe_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:360 #3 0x0809503c in access_allowed_mask (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x0) at acl.c:462 #4 0x08098b8f in acl_check_modlist (op=0x8c5d948, e=0x37f5dad0, mlist=0x34afa7c8) at acl.c:1986 #5 0x0810373a in hdb_modify_internal (op=0x8c5d948, tid=0x961aa10, modlist=0x34afa7c8, e=0x37f5dad0, text=0x37f5f0fc, textbuf=0x37f5db18 "", textlen=256) at modify.c:88 #6 0x08105292 in hdb_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:590 #7 0x080e9b02 in overlay_op_walk (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify, oi=0x8bbb550, on=0x0) at backover.c:669 #8 0x080e9cb7 in over_op_func (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify) at backover.c:721 #9 0x080e9d7f in over_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at backover.c:755 #10 0x08088a89 in fe_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:301 #11 0x080884c8 in do_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:175 #12 0x0806b959 in connection_operation (ctx=0x37f5f1d0, arg_v=0x8c5d948) at connection.c:1097 #13 0x0806be33 in connection_read_thread (ctx=0x37f5f1d0, argv=0x22) at connection.c:1223 #14 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8b6e178) at tpool.c:663 #15 0xb7e6749b in start_thread () from /lib/libpthread.so.0 #16 0xb7bf642e in clone () from /lib/libc.so.6 (gdb) p e->e_nname $3 = {bv_len = 77, bv_val = 0x0} In slap_acl_get(), e->e_nname.bv_val (e->e_ndn) is null. delta syncrepl slapd.conf ------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290 gentlehup on idletimeout 600 sizelimit 20000 sortvals member # ===== Replication db ==== database hdb suffix "cn=log" directory /data/au20aa_log rootdn "cn=Manager,dc=washington,dc=edu" #index objectClass eq #index entryUUID,entryCSN eq index default eq index entryCSN,objectClass,reqEnd,reqResult,reqStart overlay syncprov syncprov-reloadhint true syncprov-nopresent true # ===== Primary db === database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq checkpoint 10000 15 cachesize 90000 idlcachesize 180000 overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60 # accesslog overlay definitions for primary database overlay accesslog logdb cn=log logops writes logsuccess true # scan the accesslog DB every day, and purge entries older than 7 days logpurge 07+00:00 01+00:00 standard syncrepl slapd.conf ---------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290 gentlehup on idletimeout 600 sizelimit 20000 sortvals member # ===== Primary db ===== database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq checkpoint 10000 15 cachesize 90000 idlcachesize 180000 overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60 2-way multimaster syncrepl slapd.conf ------------------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290 gentlehup on idletimeout 600 sizelimit 20000 sortvals member # ===== Primary db ===== database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq checkpoint 10000 15 cachesize 90000 idlcachesize 180000 overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60 ### ### ----------- Replication ### syncrepl rid=101 provider=ldap://rufus01.u.washington.edu:290 searchbase="dc=washington,dc=edu" bindmethod=sasl saslmech=external starttls=yes tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key filter="(objectClass=*)" attrs="*,+" schemachecking=off scope=sub type=refreshAndPersist retry="60 +" syncrepl rid=102 provider=ldap://rufus04.u.washington.edu:290 searchbase="dc=washington,dc=edu" bindmethod=sasl saslmech=external starttls=yes tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key filter="(objectClass=*)" attrs="*,+" schemachecking=off scope=sub type=refreshAndPersist retry="60 +" mirrormode yes
--On Wednesday, July 08, 2009 4:44 PM +0000 imf@u.washington.edu wrote: > Full_Name: Tracy Stenvik > Version: 2.4.16 (+db-4.7.25pl4) > OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga) > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (140.142.111.41) > > > In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we > have encountered some serious issues. A similated production-level load > that 2.3.39 (with slurpd running) handles easily causes 2.4.16 (and > 2.4.15) to crash within 5 minutes to 24 hours. Please re-test with current RE24 from CVS, as numerous fixes have been made and 2.4.17 is imminent. Thanks! --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
The changes made to 2.4.17 seem to have fixed the crashes in the caching module. Thanks for that. We still are able to crash 2.4.17, however. It only happens after a heavy load is placed on the producer for >24 hours continuous. Unfortunately, we've not been able to get good tracebacks. They all look like this, (gdb) where #0 0x00869410 in __kernel_vsyscall () #1 0x00390d80 in raise () from /lib/libc.so.6 #2 0x00392691 in abort () from /lib/libc.so.6 #3 0x0038a1fb in __assert_fail () from /lib/libc.so.6 #4 0x0808d532 in malloc () #5 0x0822c93f in ?? () #6 0x0822c933 in ?? () #7 0x00000039 in ?? () #8 0x0822c908 in ?? () #9 0x00000000 in ?? () The producer slowly grows its memory footprint. I can't tell if it's from just normal operations or memory leaks. I suspect it's a little of both. The end result, as you can see from the core above, is that there's likely some corrupted (or unfreed) memory somewhere. Sorry I can't nail it down further. The load profile that we placed on the server is documented in my prior report. See above. --- Tracy Stenvik University Computing Services 354843. University of Washington email: imf@u.washington.edu voice: (206) 685-3344
imf@u.washington.edu wrote: > The changes made to 2.4.17 seem to have fixed the crashes in the caching > module. Thanks for that. > > We still are able to crash 2.4.17, however. It only happens after a heavy > load is placed on the producer for>24 hours continuous. Unfortunately, > we've not been able to get good tracebacks. They all look like this, > > (gdb) where > #0 0x00869410 in __kernel_vsyscall () > #1 0x00390d80 in raise () from /lib/libc.so.6 > #2 0x00392691 in abort () from /lib/libc.so.6 > #3 0x0038a1fb in __assert_fail () from /lib/libc.so.6 > #4 0x0808d532 in malloc () > #5 0x0822c93f in ?? () > #6 0x0822c933 in ?? () > #7 0x00000039 in ?? () > #8 0x0822c908 in ?? () > #9 0x00000000 in ?? () > > The producer slowly grows its memory footprint. I can't tell if it's from > just normal operations or memory leaks. I suspect it's a little of both. > The end result, as you can see from the core above, is that there's likely > some corrupted (or unfreed) memory somewhere. Sorry I can't nail it down > further. There's not enough information here. An assert is always accompanied by an error message on stderr; we need to see the actual error message. There are no symbols in the stack trace pertaining to slapd itself. You seem to be running a stripped binary. Please provide a trace using a non-stripped binary that was compiled with -g (debug symbols enabled). > > The load profile that we placed on the server is documented in my prior > report. See above. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Suspended
Sorry about the stripped trace. I forgot that the install procedure always strips the binaries... Okay, with our stress profile it takes ~36 hours to fail. I always start with a clean db rebuild before each run. Each failure produces the same traceback: (gdb) where #0 0x00b97410 in __kernel_vsyscall () #1 0x00471d80 in raise () from /lib/libc.so.6 #2 0x00473691 in abort () from /lib/libc.so.6 #3 0x0046b1fb in __assert_fail () from /lib/libc.so.6 #4 0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57 #5 0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742 #6 0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378, e=0x3a3dac0, flag=0) at id2entry.c:54 #7 0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378, e=0x3a3dac0) at id2entry.c:90 #8 0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611 #9 0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4, which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669 #10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4, which=op_modify) at backover.c:721 #11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at backover.c:755 #12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:301 #13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175 #14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at connection.c:1115 #15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at connection.c:1251 #16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at tpool.c:685 #17 0x0043749b in start_thread () from /lib/libpthread.so.0 #18 0x0051a42e in clone () from /lib/libc.so.6 --- Tracy Stenvik University Computing Services 354843. University of Washington email: imf@u.washington.edu voice: (206) 685-3344 On Thu, 23 Jul 2009, Howard Chu wrote: > imf@u.washington.edu wrote: >> The changes made to 2.4.17 seem to have fixed the crashes in the caching >> module. Thanks for that. >> >> We still are able to crash 2.4.17, however. It only happens after a heavy >> load is placed on the producer for>24 hours continuous. Unfortunately, >> we've not been able to get good tracebacks. They all look like this, >> >> (gdb) where >> #0 0x00869410 in __kernel_vsyscall () >> #1 0x00390d80 in raise () from /lib/libc.so.6 >> #2 0x00392691 in abort () from /lib/libc.so.6 >> #3 0x0038a1fb in __assert_fail () from /lib/libc.so.6 >> #4 0x0808d532 in malloc () >> #5 0x0822c93f in ?? () >> #6 0x0822c933 in ?? () >> #7 0x00000039 in ?? () >> #8 0x0822c908 in ?? () >> #9 0x00000000 in ?? () >> >> The producer slowly grows its memory footprint. I can't tell if it's from >> just normal operations or memory leaks. I suspect it's a little of both. >> The end result, as you can see from the core above, is that there's likely >> some corrupted (or unfreed) memory somewhere. Sorry I can't nail it down >> further. > > There's not enough information here. > > An assert is always accompanied by an error message on stderr; we need to see > the actual error message. > > There are no symbols in the stack trace pertaining to slapd itself. You seem > to be running a stripped binary. Please provide a trace using a non-stripped > binary that was compiled with -g (debug symbols enabled). >> >> The load profile that we placed on the server is documented in my prior >> report. See above. > > > -- > -- Howard Chu > CTO, Symas Corp. http://www.symas.com > Director, Highland Sun http://highlandsun.com/hyc/ > Chief Architect, OpenLDAP http://www.openldap.org/project/ >
> Sorry about the stripped trace. I forgot that the install procedure > always strips the binaries... > > Okay, with our stress profile it takes ~36 hours to fail. I always > start with a clean db rebuild before each run. Each failure produces > the same traceback: > > (gdb) where > #0 0x00b97410 in __kernel_vsyscall () > #1 0x00471d80 in raise () from /lib/libc.so.6 > #2 0x00473691 in abort () from /lib/libc.so.6 > #3 0x0046b1fb in __assert_fail () from /lib/libc.so.6 > #4 0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57 ^^^ this really looks like memory exhaustion while trying to malloc a large chunk (>4MB). Can you tell, by printing e->e_name, whether it's correct that the server was modifying a large entry? p. > #5 0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742 > #6 0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378, > e=0x3a3dac0, flag=0) at id2entry.c:54 > #7 0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378, > e=0x3a3dac0) at id2entry.c:90 > #8 0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611 > #9 0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4, > which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669 > #10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4, > which=op_modify) at backover.c:721 > #11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at > backover.c:755 > #12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at > modify.c:301 > #13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175 > #14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at > connection.c:1115 > #15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at > connection.c:1251 > #16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at > tpool.c:685 > #17 0x0043749b in start_thread () from /lib/libpthread.so.0 > #18 0x0051a42e in clone () from /lib/libc.so.6
On Tue, 18 Aug 2009, masarati@aero.polimi.it wrote: >> Sorry about the stripped trace. I forgot that the install procedure >> always strips the binaries... >> >> Okay, with our stress profile it takes ~36 hours to fail. I always >> start with a clean db rebuild before each run. Each failure produces >> the same traceback: >> >> (gdb) where >> #0 0x00b97410 in __kernel_vsyscall () >> #1 0x00471d80 in raise () from /lib/libc.so.6 >> #2 0x00473691 in abort () from /lib/libc.so.6 >> #3 0x0046b1fb in __assert_fail () from /lib/libc.so.6 >> #4 0x0808d532 in ch_malloc (size=4436335) at ch_malloc.c:57 > > ^^^ this really looks like memory exhaustion while trying to malloc a > large chunk (>4MB). Can you tell, by printing e->e_name, whether it's > correct that the server was modifying a large entry? int entry_encode(Entry *e, struct berval *bv) { ber_len_t len, dnlen, ndnlen, i; int nattrs, nvals; Attribute *a; unsigned char *ptr; Debug( LDAP_DEBUG_TRACE, "=> entry_encode(0x%08lx): %s\n", (long) e->e_id, e->e_dn, 0 ); dnlen = e->e_name.bv_len; ndnlen = e->e_nname.bv_len; entry_partsize( e, &len, &nattrs, &nvals, 1 ); bv->bv_len = len; bv->bv_val = ch_malloc(len); (gdb) p *e $8 = {e_id = 343637, e_name = {bv_len = 0, bv_val = 0x822169c ""}, e_nname = {bv_len = 0, bv_val = 0x822169c ""}, e_attrs = 0x208d270c, e_ocflags = 256, e_bv = {bv_len = 6598587, bv_val = 0x1ddb5008 "\021"}, e_private = 0x9ad5ce0} (gdb) p *bv $2 = {bv_len = 4436293, bv_val = 0x1 <Address 0x1 out of bounds>} (gdb) p nattrs $6 = 16 (gdb) p nvals $7 = 270294 Well, according to e->e_name, e_dn (e_name.bv_val) is empty, so this is likely bogus. We do have some large member entries (>100K) in our test profile, however none add up to anywhere near >4M. Tracy > > p. > >> #5 0x08079ad2 in entry_encode (e=0x3a3dac0, bv=0x3a3d9b0) at entry.c:742 >> #6 0x0815240e in bdb_id2entry_put (be=0x3a3dca0, tid=0xbc6f7378, >> e=0x3a3dac0, flag=0) at id2entry.c:54 >> #7 0x08152508 in hdb_id2entry_update (be=0x3a3dca0, tid=0xbc6f7378, >> e=0x3a3dac0) at id2entry.c:90 >> #8 0x08106374 in hdb_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:611 >> #9 0x080ea38e in overlay_op_walk (op=0xdabbc28, rs=0x3a3f0e4, >> which=op_modify, oi=0x8be2788, on=0x0) at backover.c:669 >> #10 0x080ea543 in over_op_func (op=0xdabbc28, rs=0x3a3f0e4, >> which=op_modify) at backover.c:721 >> #11 0x080ea60b in over_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at >> backover.c:755 >> #12 0x08089151 in fe_op_modify (op=0xdabbc28, rs=0x3a3f0e4) at >> modify.c:301 >> #13 0x08088b90 in do_modify (op=0xdabbc28, rs=0x3a3f0e4) at modify.c:175 >> #14 0x0806be8f in connection_operation (ctx=0x3a3f1d0, arg_v=0xdabbc28) at >> connection.c:1115 >> #15 0x0806c3cf in connection_read_thread (ctx=0x3a3f1d0, argv=0x1a) at >> connection.c:1251 >> #16 0x081d8fa9 in ldap_int_thread_pool_wrapper (xpool=0x8b941b0) at >> tpool.c:685 >> #17 0x0043749b in start_thread () from /lib/libpthread.so.0 >> #18 0x0051a42e in clone () from /lib/libc.so.6 > > >
What does this look with top and/or in dmesg over the run time? Is it a simple out-of-memory? Definitely a bit of a gross method, but what's ls -lh core show for size? (If so, is it warranted given your load or is there a leak, etc etc...and of course make sure you're up to date on the surrounding packages, OpenLDAP isn't the only thing that can leak.)
On Mon, 24 Aug 2009, Aaron Richton wrote: > What does this look with top and/or in dmesg over the run time? Is it a > simple out-of-memory? Definitely a bit of a gross method, but what's ls -lh > core show for size? I haven't been able to monitor it constantly over its lifespan, as it takes anywhere from 24 to 36 hours of our load before failure, but what I have observed is that slapd continually grows (slowly) over the course of the test until it can't malloc any more. Core sizes run around 640Mb. The largest '__db' element is 1.9Gb. So, slapd behaves as if there's a slow memory leak, somewhere. Unless the delta syncrepl mechanism (on producer) has an insatiable memory appetite. Our test suite is heavy on updates, including mods to large-memeber groups (with 120K+ members). The consumer slapd never dies, and more or less keeps up with the change load from the producer. > > (If so, is it warranted given your load or is there a leak, etc etc...and of > course make sure you're up to date on the surrounding packages, OpenLDAP > isn't the only thing that can leak.) > openldap-2.4.17 db-4.7.25 (patch level 4) openssl 0.9.8e Red Hat Enterprise Linux Server release 5.4 Tracy
We are trying out the 2.4.18 release with a syncrepl Multi-Master setup and are experiencing similar problems as mentioned above. It will shut down with a segfault after heavy processing/updating of about 3000-17000 entries. We can reproduce this in about 10-20 minutes of heavy load. However, in this short execution time we can't find any signs of memory leakage. Unfortunatly I can't get any good tracebacks from gdb. I've tried compiling openldap with: export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with /configure. Any pointers? This is the traceback I get, but I guess it isn't very helpful without the complete debugging information: (gdb) where #0 0x00007fc7e24fb8eb in syncprov_op_mod (op=0x7fc7aedb37d0, rs=<value optimized out>) at syncprov.c:1970 #1 0x000000000049531a in overlay_op_walk () #2 0x0000000000495e7a in ?? () #3 0x000000000044a2e2 in fe_op_modify () #4 0x000000000044ac67 in do_modify () #5 0x00000000004321bf in ?? () #6 0x0000000000432e6c in ?? () #7 0x00000000004dd960 in ?? () #8 0x00007fc7e2e9b3ea in start_thread () from /lib/libpthread.so.0 #9 0x00007fc7e29f3cbd in clone () from /lib/libc.so.6 #10 0x0000000000000000 in ?? () (gdb) Could these problems be related? And is there something we can do to help resolve it? Regards, Joacim Breiler Configuration: dn: olcDatabase={1}hdb,cn=config objectClass: olcDatabaseConfig objectClass: olcHdbConfig olcDatabase: {1}hdb olcDbDirectory: /srv/ldap olcSuffix: dc=hgo,dc=se olcAccess: {0}to attrs=userPassword,shadowLastChange by dn="cn=admin,dc=hgo, dc=se" write by anonymous auth by self write by * none olcLastMod: TRUE olcRootDN: cn=admin,dc=hgo,dc=se olcRootPW: xxxxxxxx olcSizeLimit: -1 olcSyncrepl: {0}rid=004 provider=ldap://diablo binddn="cn=admin,dc=hgo,dc=se " bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" ty pe=refreshAndPersist retry="5 5 300 5" timeout=1 olcSyncrepl: {1}rid=003 provider=ldap://mephisto binddn="cn=admin,dc=hgo,dc= se" bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" type=refreshAndPersist retry="5 5 300 5" timeout=1 olcMirrorMode: TRUE olcDbCheckpoint: 2048 5 olcDbIDLcacheSize: 90000 olcDbIndex: objectClass,entryCSN,entryUUID,cn eq olcDbCacheSize: 10000 olcDbConfig: {0}set_cachesize 1 0 0 olcDbConfig: {1}set_lk_max_objects 32000 olcDbConfig: {2}set_lk_max_locks 32000 olcDbConfig: {3}set_lk_max_lockers 1500 olcDbConfig: {4}set_flags DB_LOG_AUTOREMOVE olcDbConfig: {5}set_lg_bsize 104857600 olcDbConfig: {6}set_lg_dir /srv/ldap
--On Thursday, September 17, 2009 1:21 PM +0000 joacim.breiler@hgo.se wrote: > We are trying out the 2.4.18 release with a syncrepl Multi-Master setup > and are experiencing similar problems as mentioned above. > > It will shut down with a segfault after heavy processing/updating of > about 3000-17000 entries. We can reproduce this in about 10-20 minutes > of heavy load. However, in this short execution time we can't find any > signs of memory leakage. > > Unfortunatly I can't get any good tracebacks from gdb. I've tried > compiling openldap with: > export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with > /configure. Any pointers? > > This is the traceback I get, but I guess it isn't very helpful without > the complete debugging information: When you ran "make install" did you put STRIP=""? Like make install STRIP="" otherwise, the slapd binary has all its debugging information stripped out. --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
We build with, % CPPFLAGS="-I/usr/local/krb5/include -I/usr/local/include/sasl" \ CFLAGS="-g" ./configure ... % make The 'CFLAGS' environmental will make sure the '-g' is injected into all generated makefiles. You can add -D_GNU_SOURCE there as well. We experienced the problems you described with <2.4.17. 2.4.17 fixed many of the segfault causes. We didn't repeat our stress test with multimaster and 2.4.17+ though, since we determined that deltasync was good enough for our purposes. It may very well be that 2.4.17+ would crash in multimaster mode with our stress test. The traceback you provided looks very similar to what we were seeing. If I have time I could run a test with 2.4.18 and multimaster and see what the results are... --- Tracy Stenvik University Computing Services 354843. University of Washington email: imf@u.washington.edu voice: (206) 685-3344 On Thu, 17 Sep 2009, Joacim Breiler wrote: > We are trying out the 2.4.18 release with a syncrepl Multi-Master setup and > are experiencing similar problems as mentioned above. > > It will shut down with a segfault after heavy processing/updating of about > 3000-17000 entries. We can reproduce this in about 10-20 minutes of heavy > load. However, in this short execution time we can't find any signs of memory > leakage. > > Unfortunatly I can't get any good tracebacks from gdb. I've tried compiling > openldap with: > export CPPFLAGS="-D_GNU_SOURCE -g". And also --enable-debug with /configure. > Any pointers? > > This is the traceback I get, but I guess it isn't very helpful without the > complete debugging information: > > (gdb) where > #0 0x00007fc7e24fb8eb in syncprov_op_mod (op=0x7fc7aedb37d0, rs=<value > optimized out>) at syncprov.c:1970 > #1 0x000000000049531a in overlay_op_walk () > #2 0x0000000000495e7a in ?? () > #3 0x000000000044a2e2 in fe_op_modify () > #4 0x000000000044ac67 in do_modify () > #5 0x00000000004321bf in ?? () > #6 0x0000000000432e6c in ?? () > #7 0x00000000004dd960 in ?? () > #8 0x00007fc7e2e9b3ea in start_thread () from /lib/libpthread.so.0 > #9 0x00007fc7e29f3cbd in clone () from /lib/libc.so.6 > #10 0x0000000000000000 in ?? () > (gdb) > > Could these problems be related? And is there something we can do to help > resolve it? > > Regards, > Joacim Breiler > > > > Configuration: > > dn: olcDatabase={1}hdb,cn=config > objectClass: olcDatabaseConfig > objectClass: olcHdbConfig > olcDatabase: {1}hdb > olcDbDirectory: /srv/ldap > olcSuffix: dc=hgo,dc=se > olcAccess: {0}to attrs=userPassword,shadowLastChange by dn="cn=admin,dc=hgo, > dc=se" write by anonymous auth by self write by * none > olcLastMod: TRUE > olcRootDN: cn=admin,dc=hgo,dc=se > olcRootPW: xxxxxxxx > olcSizeLimit: -1 > olcSyncrepl: {0}rid=004 provider=ldap://diablo binddn="cn=admin,dc=hgo,dc=se > " bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" ty > pe=refreshAndPersist retry="5 5 300 5" timeout=1 > olcSyncrepl: {1}rid=003 provider=ldap://mephisto binddn="cn=admin,dc=hgo,dc= > se" bindmethod=simple credentials="xxxxxxx" searchbase="dc=hgo,dc=se" > type=refreshAndPersist retry="5 5 300 5" timeout=1 > olcMirrorMode: TRUE > olcDbCheckpoint: 2048 5 > olcDbIDLcacheSize: 90000 > olcDbIndex: objectClass,entryCSN,entryUUID,cn eq > olcDbCacheSize: 10000 > olcDbConfig: {0}set_cachesize 1 0 0 > olcDbConfig: {1}set_lk_max_objects 32000 > olcDbConfig: {2}set_lk_max_locks 32000 > olcDbConfig: {3}set_lk_max_lockers 1500 > olcDbConfig: {4}set_flags DB_LOG_AUTOREMOVE > olcDbConfig: {5}set_lg_bsize 104857600 > olcDbConfig: {6}set_lg_dir /srv/ldap > > >
Hello, i have the same problem with 2 Ubuntu 9.04 Systems with the current build from HEAD. I created 20000 entries at server1. Then i created another 20000 entries at server1 while reading the old and new entries from server1 and server2. Here is the backtrace: slapd: pthread_mutex_lock.c:289: __pthread_mutex_lock: Assertion `(-(e)) != 3 || !robust' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x923f9b90 (LWP 17860)] 0xb7fe0430 in __kernel_vsyscall () (gdb) bt #0 0xb7fe0430 in __kernel_vsyscall () #1 0xb7b756d0 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7b77098 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7b6e5ce in __assert_fail () from /lib/tls/i686/cmov/libc.so.6 #4 0xb7e5df29 in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0 #5 0x0813fccb in ldap_pvt_thread_mutex_lock (mutex=0x8bc64dc) at thr_posix.c:296 #6 0xb7998e8e in syncprov_op_mod (op=0x8bc3690, rs=0x923f911c) at syncprov.c:1965 #7 0x08105bfd in overlay_op_walk (op=0x8bc3690, rs=0x923f911c, which=op_modify, oi=0x87e9d18, on=0x87eeef0) at backover.c:659 #8 0x08105e9a in over_op_func (op=0x8bc3690, rs=0x923f911c, which=op_modify) at backover.c:721 #9 0x08105fc7 in over_op_modify (op=0x8bc3690, rs=0x923f911c) at backover.c:760 #10 0x080a117e in fe_op_modify (op=0x8bc3690, rs=0x923f911c) at modify.c:301 #11 0x080a0a6d in do_modify (op=0x8bc3690, rs=0x923f911c) at modify.c:175 #12 0x08082a43 in connection_operation (ctx=0x923f9210, arg_v=0x8bc3690) at connection.c:1123 #13 0x08082f8b in connection_read_thread (ctx=0x923f9210, argv=0x12) at connection.c:1259 #14 0x0813eb0e in ldap_int_thread_pool_wrapper (xpool=0x87af8e8) at tpool.c:685 #15 0xb7e5c4ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #16 0xb7c2e49e in clone () from /lib/tls/i686/cmov/libc.so.6 Please let me know if i can provide additional information for you. (But please tell me how to do so, my gdb knowledge is fairly poor) Regards, Frank Offermanns
Thanks for your fast replies! The 'make install STRIP=""' did the trick. I ran the server for about a week with no segfaults, then I decided that I couldn't wait for it any longer. So I connected to the server using an LDAP browser, and then after about ten minutes it segfaulted. So to reproduce I need one connection that is reading/updating/creating entries. And one connection that does an occasional read, otherwise the server will keep on running. The traceback from gdb gives: 0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, rs=0x44688c80) at syncprov.c:1970 1970 for ( m2 = mt->mt_mods; m2->mi_next != mi; (gdb) where #0 0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, rs=0x44688c80) at syncprov.c:1970 #1 0x00000000004c438a in overlay_op_walk (op=0x7f9688000ac0, rs=0x44688c80, which=op_modify, oi=0x12293c0, on=0x12295a0) at backover.c:659 #2 0x00000000004c4667 in over_op_func (op=0x7f9688000ac0, rs=0x44688c80, which=op_modify) at backover.c:721 #3 0x00000000004c47ac in over_op_modify (op=0x7f9688000ac0, rs=0x44688c80) at backover.c:760 #4 0x000000000045bb21 in fe_op_modify (op=0x7f9688000ac0, rs=0x44688c80) at modify.c:301 #5 0x000000000045b430 in do_modify (op=0x7f9688000ac0, rs=0x44688c80) at modify.c:175 #6 0x000000000043c3ec in connection_operation (ctx=0x44688de0, arg_v=0x7f9688000ac0) at connection.c:1123 #7 0x000000000043c975 in connection_read_thread (ctx=0x44688de0, argv=0x17) at connection.c:1259 #8 0x0000000000525193 in ldap_int_thread_pool_wrapper (xpool=0x1190290) at tpool.c:685 #9 0x00007f96c49613ea in start_thread () from /lib/libpthread.so.0 #10 0x00007f96c44b9cbd in clone () from /lib/libc.so.6 #11 0x0000000000000000 in ?? () /J
joacim.breiler@hgo.se wrote: > Thanks for your fast replies! The 'make install STRIP=""' did the trick. > > I ran the server for about a week with no segfaults, then I decided that > I couldn't wait for it any longer. So I connected to the server using an > LDAP browser, and then after about ten minutes it segfaulted. So to > reproduce I need one connection that is reading/updating/creating > entries. And one connection that does an occasional read, otherwise the > server will keep on running. > > The traceback from gdb gives: This trace looks like the problem fixed in ITS#6335. The fix is in CVS HEAD and RE24 (2.4.20 pre-release). > > 0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, rs=0x44688c80) > at syncprov.c:1970 > 1970 for ( m2 = mt->mt_mods; m2->mi_next != mi; > (gdb) where > #0 0x00007f96c3fc00ad in syncprov_op_mod (op=0x7f9688000ac0, > rs=0x44688c80) at syncprov.c:1970 > #1 0x00000000004c438a in overlay_op_walk (op=0x7f9688000ac0, > rs=0x44688c80, which=op_modify, oi=0x12293c0, on=0x12295a0) at > backover.c:659 > #2 0x00000000004c4667 in over_op_func (op=0x7f9688000ac0, > rs=0x44688c80, which=op_modify) at backover.c:721 > #3 0x00000000004c47ac in over_op_modify (op=0x7f9688000ac0, > rs=0x44688c80) at backover.c:760 > #4 0x000000000045bb21 in fe_op_modify (op=0x7f9688000ac0, > rs=0x44688c80) at modify.c:301 > #5 0x000000000045b430 in do_modify (op=0x7f9688000ac0, rs=0x44688c80) > at modify.c:175 > #6 0x000000000043c3ec in connection_operation (ctx=0x44688de0, > arg_v=0x7f9688000ac0) at connection.c:1123 > #7 0x000000000043c975 in connection_read_thread (ctx=0x44688de0, > argv=0x17) at connection.c:1259 > #8 0x0000000000525193 in ldap_int_thread_pool_wrapper (xpool=0x1190290) > at tpool.c:685 > #9 0x00007f96c49613ea in start_thread () from /lib/libpthread.so.0 > #10 0x00007f96c44b9cbd in clone () from /lib/libc.so.6 > #11 0x0000000000000000 in ?? () > > /J -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Suspended to Feedback
changed notes changed state Feedback to Closed
Same as ITS#6335