Full_Name: Leonid Yuriev Version: 2.4.40 OS: RHEL7 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (31.130.36.33) Currently there is a couple of backtraces only. This is the result of a stress test of replication in the presence of sync-conflicts after a "split brain" case. No any network troubles (just a loopback connections). ** Signal 11 (Segmentation fault), address is 0x797a from 0x50e8bd (0) /opt/openldap.devel/libexec/slapd() [0x50e8bd]: syncprov_op_abandon /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1134 (1) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk /home/ly/Projects/openldap.git/servers/slapd/backover.c:662 (2) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func /home/ly/Projects/openldap.git/servers/slapd/backover.c:724 (3) /opt/openldap.devel/libexec/slapd() [0x4429a7]: fe_op_abandon /home/ly/Projects/openldap.git/servers/slapd/abandon.c:134 (discriminator 2) (4) /opt/openldap.devel/libexec/slapd() [0x42283c]: connection_abandon /home/ly/Projects/openldap.git/servers/slapd/connection.c:740 (discriminator 3) (5) /opt/openldap.devel/libexec/slapd() [0x424509]: connection_closing /home/ly/Projects/openldap.git/servers/slapd/connection.c:829 (6) /opt/openldap.devel/libexec/slapd() [0x4250ef]: connection_read /home/ly/Projects/openldap.git/servers/slapd/connection.c:1477 connection_read_thread /home/ly/Projects/openldap.git/servers/slapd/connection.c:1284 (7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f8c6de89cf2]: ?? *** Signal 11 (Segmentation fault), address is 0x7f7a1e1c5000 from 0x7f7a1eebfb54 (0) /lib/x86_64-linux-gnu/libc.so.6(+0x98b54) [0x7f7a1eebfb54]: ????:0 (1) /opt/openldap.devel/libexec/slapd() [0x4afa7d]: mdb_search /home/ly/Projects/openldap.git/servers/slapd/back-mdb/search.c:987 (discriminator 3) (2) /opt/openldap.devel/libexec/slapd() [0x48b356]: 543edda2 overlay_op_walk /home/ly/Projects/openldap.git/servers/slapd/backover.c:674 (3) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func /home/ly/Projects/openldap.git/servers/slapd/backover.c:724 (4) /opt/openldap.devel/libexec/slapd() [0x427201]: fe_op_search /home/ly/Projects/openldap.git/servers/slapd/search.c:402 (5) /opt/openldap.devel/libexec/slapd() [0x426c0c]: do_search /home/ly/Projects/openldap.git/servers/slapd/search.c:247 (6) /opt/openld.d.devel/libexec/slapd() [0x424b54]: connection_operation /home/ly/Projects/openldap.git/servers/slapd/connection.c:1158 (7) /opt/openldap.devel/libexec/slapd() [0x42526c]: connection_read_thread /home/ly/Projects/openldap.git/servers/slapd/connection.c:1291 (8) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f7a1f629cf2]: ??
One more case *** Signal 11 (Segmentation fault), address is 0xb from 0x442f07 (0) /opt/openldap.devel/libexec/slapd() [0x442f07]: test_filter /home/ly/Projects/openldap.git/servers/slapd/filterentry.c:69 (1) /opt/openldap.devel/libexec/slapd() [0x514721]: syncprov_matchops /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1316 (2) /opt/openldap.devel/libexec/slapd() [0x514b83]: syncprov_op_mod /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:2145 (3) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk /home/ly/Projects/openldap.git/servers/slapd/backover.c:662 (4) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func /home/ly/Projects/openldap.git/servers/slapd/backover.c:724 (5) /opt/openldap.devel/libexec/slapd() [0x4811a6]: syncrepl_entry /home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:3177 do_syncrep2 /home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:1024 (6) /opt/openldap.devel/libexec/slapd() [0x4844b2]: do_syncrepl /home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:1539 (7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f398891acf2]: ??
leo@yuriev.ru wrote: > Full_Name: Leonid Yuriev > Version: 2.4.40 > OS: RHEL7 > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (31.130.36.33) > > > Currently there is a couple of backtraces only. > This is the result of a stress test of replication in the presence of > sync-conflicts after a "split brain" case. > No any network troubles (just a loopback connections). Looks like it's accessing freed memory, and it's also related to Abandon processing. You might be running into ITS#7967 as well. If you can reproduce this, try running with a malloc debugger. > > ** Signal 11 (Segmentation fault), address is 0x797a from 0x50e8bd > (0) /opt/openldap.devel/libexec/slapd() [0x50e8bd]: syncprov_op_abandon > /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1134 > (1) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk > /home/ly/Projects/openldap.git/servers/slapd/backover.c:662 > (2) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func > /home/ly/Projects/openldap.git/servers/slapd/backover.c:724 > (3) /opt/openldap.devel/libexec/slapd() [0x4429a7]: fe_op_abandon > /home/ly/Projects/openldap.git/servers/slapd/abandon.c:134 (discriminator 2) > (4) /opt/openldap.devel/libexec/slapd() [0x42283c]: connection_abandon > /home/ly/Projects/openldap.git/servers/slapd/connection.c:740 (discriminator 3) > (5) /opt/openldap.devel/libexec/slapd() [0x424509]: connection_closing > /home/ly/Projects/openldap.git/servers/slapd/connection.c:829 > (6) /opt/openldap.devel/libexec/slapd() [0x4250ef]: connection_read > /home/ly/Projects/openldap.git/servers/slapd/connection.c:1477 > connection_read_thread > /home/ly/Projects/openldap.git/servers/slapd/connection.c:1284 > (7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f8c6de89cf2]: ?? > > *** Signal 11 (Segmentation fault), address is 0x7f7a1e1c5000 from > 0x7f7a1eebfb54 > (0) /lib/x86_64-linux-gnu/libc.so.6(+0x98b54) [0x7f7a1eebfb54]: ????:0 > (1) /opt/openldap.devel/libexec/slapd() [0x4afa7d]: mdb_search > /home/ly/Projects/openldap.git/servers/slapd/back-mdb/search.c:987 > (discriminator 3) > (2) /opt/openldap.devel/libexec/slapd() [0x48b356]: 543edda2 overlay_op_walk > /home/ly/Projects/openldap.git/servers/slapd/backover.c:674 > (3) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func > /home/ly/Projects/openldap.git/servers/slapd/backover.c:724 > (4) /opt/openldap.devel/libexec/slapd() [0x427201]: fe_op_search > /home/ly/Projects/openldap.git/servers/slapd/search.c:402 > (5) /opt/openldap.devel/libexec/slapd() [0x426c0c]: do_search > /home/ly/Projects/openldap.git/servers/slapd/search.c:247 > (6) /opt/openld.d.devel/libexec/slapd() [0x424b54]: connection_operation > /home/ly/Projects/openldap.git/servers/slapd/connection.c:1158 > (7) /opt/openldap.devel/libexec/slapd() [0x42526c]: connection_read_thread > /home/ly/Projects/openldap.git/servers/slapd/connection.c:1291 > (8) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f7a1f629cf2]: ?? > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Once again SIGSEGV. I think the problem is not here, but in a connection cancel/abandon code. It seems like race conditions with asynchronous connection dropping. But currently I not review enough of code. /servers/slapd/overlays/syncprov.c @@ -1307,21 +1307,21 @@ syncprov_matchops( Operation *op, opcookie *opc, int saveit ) op2.o_hdr = &oh; op2.o_extra = op->o_extra; op2.o_callback = NULL; if (ss->s_flags & PS_FIX_FILTER) { /* Skip the AND/GE clause that we stuck on in front. We would lose deletes/mods that happen during the refresh phase otherwise (ITS#6555) */ op2.ors_filter = ss->s_op->ors_filter->f_and->f_next; } ldap_pvt_thread_mutex_unlock( &ss->s_mutex ); rc = test_filter( &op2, e, op2.ors_filter ); } Debug( LDAP_DEBUG_NONE, "syncprov_matchops: sid %03x fscope %d rc %d\n", ss->s_sid, fc.fscope, rc ); Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f9762ffe700 (LWP 29507)] test_filter (op=0x7f9762ffc210, e=0x7f96f19c37d8, f=0x20) at filterentry.c:69 69 if ( f->f_choice & SLAPD_FILTER_UNDEFINED ) { (0) /opt/openldap.devel/libexec/slapd() [0x4430b7]: test_filter /home/ly/Projects/openldap.git/servers/slapd/filterentry.c:69 (1) /opt/openldap.devel/libexec/slapd() [0x515081]: syncprov_matchops /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1317 (2) /opt/openldap.devel/libexec/slapd() [0x515f43]: syncprov_op_response /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1941 (3) /opt/openldap.devel/libexec/slapd() [0x434163]: slap_response_play /home/ly/Projects/openldap.git/servers/slapd/result.c:509 (4) /opt/openldap.devel/libexec/slapd() [0x4346ca]: send_ldap_response /home/ly/Projects/openldap.git/servers/slapd/result.c:584 (5) /opt/openldap.devel/libexec/slapd() [0x435062]: slap_send_ldap_result /home/ly/Projects/openldap.git/servers/slapd/result.c:861 (6) /opt/openldap.devel/libexec/slapd() [0x4cb2e9]: mdb_add /home/ly/Projects/openldap.git/servers/slapd/back-mdb/add.c:434 (7) /opt/openldap.devel/libexec/slapd() [0x48b506]: overlay_op_walk /home/ly/Projects/openldap.git/servers/slapd/backover.c:674 (8) /opt/openldap.devel/libexec/slapd() [0x48b671]: over_op_func /home/ly/Projects/openldap.git/servers/slapd/backover.c:724
I am cherry-picked the fix of ITS#7967 and other from master - no changes in behavior, just a stable sigsegv.
leo@yuriev.ru wrote: > I am cherry-picked the fix of ITS#7967 and other from master - no > changes in behavior, just a stable sigsegv. OK. We've had abandon cleanup issues in the area of code you highlighted before; just need a simple method to reproduce the error. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
current OPENLDAP_REL_ENG_2_4 (6b26910 Silence compiler warning...) with merge-in current mdb.master (9a72292 ITS#7961,#7987 Re-fix txn init) - cluster of 4 node, but on single machine. - only localback network, no any failures. - multi-master by config, but all writes come only to first node. Testcase will be available shortly (config + script). Core was generated by `/opt/openldap.devel/libexec/slapd -l LOCAL5 -d 0 -s 0 -4 -h ldap://10.4.0.1:1114'. Program terminated with signal 11, Segmentation fault. #0 0x00007ff68564d07b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007ff68564d07b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x000000000047fb3c in do_syncrep2 (op=0x7ff62f7fd740, si=0x1c74190) at syncrepl.c:934 #2 0x00000000004838c3 in do_syncrepl (ctx=<optimised out>, arg=0x1c746d0) at syncrepl.c:1539 #3 0x00000000004250a8 in connection_read_thread (ctx=0x7ff62f7fdbd0, argv=0x35) at connection.c:1293 #4 0x00007ff685ceecf2 in ldap_int_thread_pool_wrapper (xpool=0x1c27090) at tpool.c:688 #5 0x00007ff6858b90a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #6 0x00007ff6855e684d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > if ( !BER_BVISNULL( &syncCookie.octet_str ) ) > { > slap_parse_sync_cookie( &syncCookie, NULL ); > if ( syncCookie.ctxcsn ) { > int i, sid = slap_parse_csn_sid( syncCookie.ctxcsn ); > check_syncprov( op, si ); > for ( i =0; i<si->si_cookieState->cs_num; i++ ) { > /* new SID */ > if ( sid < si->si_cookieState->cs_sids[i] ) > break; > if ( si->si_cookieState->cs_sids[i] == sid ) { syncrepl.c:934 > if ( ber_bvcmp( syncCookie.ctxcsn, &si->si_cookieState->cs_vals[i] ) <= 0 ) { > bdn.bv_val[bdn.bv_len] = '\0'; > Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s CSN too old, ignoring %s (%s)\n", > si->si_ridtxt, syncCookie.ctxcsn->bv_val, bdn.bv_val ); > ldap_controls_free( rctrls ); > rc = 0; > si->si_too_old = 1; > goto done; > } > si->si_too_old = 0; > break; > } > }
A simple testcase is attached. All activity (add/delete/read) come via first node of 4x cluster. Unfortunately a lot of time may be required to reproduce a bug (coredump), from 10 minutes up to 2-3 hours. Leonid.
Program terminated with signal 11, Segmentation fault. #0 0x000000000047fb64 in do_syncrep2 (op=0x7f8c494e7740, si=0x19967e0) at syncrepl.c:892 892 bdn.bv_val[bdn.bv_len] = '\0'; (gdb) bt #0 0x000000000047fb64 in do_syncrep2 (op=0x7f8c494e7740, si=0x19967e0) at syncrepl.c:892 #1 0x0000000000483903 in do_syncrepl (ctx=<optimised out>, arg=0x1996410) at syncrepl.c:1551 #2 0x00000000004250e8 in connection_input (cri=<optimised out>, conn=<optimised out>) at connection.c:1732 #3 connection_read (cri=<optimised out>, s=<optimised out>) at connection.c:1460 #4 connection_read_thread (ctx=0x7f8c494e7bd0, argv=0x21) at connection.c:1284 #5 0x00007f8c6bb5bd22 in ldap_int_thread_pool_wrapper (xpool=0x194a090) at tpool.c:688 #6 0x00007f8c6b7260a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #7 0x00007f8c6b45384d in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) info locals syncUUID = {{bv_len = 16, bv_val = 0x7f8c40106ad7 "\023Z,P\f!\020\064\223e\177\031\357Ζ:"}, {bv_len = 0, bv_val = 0x0}} cookie = {bv_len = 60, bv_val = 0x7f8c40106ae9 "rid=001,sid=001,csn=20141129143806.208485Z#000000#001#000000"} rctrls = 0x7f8c40104ad0 bdn = {bv_len = 34, bv_val = 0x7f8c40105209 "cn=tablet,uid=1756,dc=ngdr,dc=ldap"} si_tag = 1 syncstate = 1 retdata = 0x19eb878 retoid = 0x0 syncUUIDs = 0x0 len = 60 berbuf = { buffer = "\002\000\001", '\000' <repeats 29 times>, "\320j\020@\214\177\000\000%k\020@\214\177\000\000%k\020@\214\177", '\000' <repeats 34 times>, "i\315rk\214\177\000\000\000\000\000\000\000\000\000\000xQ\224k\214\177\000\000\340b\231\001", '\000' <repeats 36 times>, "\a\000\000\000\000\000\000\000\020c\231\001\000\000\000\000\a\000\000\000\000\000\000\000@b\231\001\000\000\000\000\006\000\000\000\000\000\000\000\300b\231\001\000\000\000\000"..., ialign = 65538, lalign = 65538, falign = 9.18382988e-41, dalign = 3.2380074297143616e-319, palign = 0x10002 <Address 0x10002 out of bounds>} msg = 0x7f8c40102be0 syncCookie = {ctxcsn = 0x7f8c401038f0, sids = 0x7f8c40105890, numcsns = 1, rid = 1, octet_str = {bv_len = 60, bv_val = 0x7f8c40104e10 "rid=001,sid=001,csn=20141129143806.208485Z#000000#001#000000"}, sid = 1, sc_next = {stqe_next = 0x0}} syncCookie_req = {ctxcsn = 0x7f8c40105e80, sids = 0x7f8c40105020, numcsns = 1, rid = 1, octet_str = {bv_len = 60, bv_val = 0x7f8c40105760 "rid=001,sid=001,csn=20141129143806.208232Z#000000#001#000000"}, sid = 1, sc_next = {stqe_next = 0x0}} rc = 100 err = 0 modlist = 0x7f8c40106a70 m = 32652 tout = {tv_sec = 0, tv_usec = 0} refreshDeletes = 0 empty = "empty" (gdb)
Program terminated with signal 11, Segmentation fault. #0 0x00007ffda90d607b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007ffda90d607b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x000000000047fb7c in do_syncrep2 (op=0x7ffd68cf5740, si=0x1681d00) at syncrepl.c:893 #2 0x0000000000483903 in do_syncrepl (ctx=<optimised out>, arg=0x1682090) at syncrepl.c:1551 #3 0x00000000004250e8 in connection_input (cri=<optimised out>, conn=<optimised out>) at connection.c:1732 #4 connection_read (cri=<optimised out>, s=<optimised out>) at connection.c:1460 #5 connection_read_thread (ctx=0x7ffd68cf5bd0, argv=0x25) at connection.c:1284 #6 0x00007ffda9777d22 in ldap_int_thread_pool_wrapper (xpool=0x1635090) at tpool.c:688 #7 0x00007ffda93420a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #8 0x00007ffda906f84d in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) frame 1 #1 0x000000000047fb7c in do_syncrep2 (op=0x7ffd68cf5740, si=0x1681d00) at syncrepl.c:893 893 Debug( LDAP_DEBUG_ANY, "do_syncrep2: %s malformed message (%s)\n", (gdb) info locals syncUUID = {{bv_len = 16, bv_val = 0x7ffd543a98b7 "\177]\201N\f'\020\064\235\255Չ\v\247\331", <incomplete sequence \351>}, {bv_len = 128849018894, bv_val = 0x7ffd68cf4ec4 "\375\177"}} cookie = {bv_len = 60, bv_val = 0x7ffd543a98c9 "rid=002,sid=002,csn=20141129152404.971057Z#000000#001#000000"} rctrls = 0x7ffd543f7130 bdn = {bv_len = 33, bv_val = 0x7ffd543706f7 "cn=modem,uid=4711,dc=ngdr,dc=ldap"} si_tag = 1 syncstate = 3 retdata = 0x7ffd68cf57c0 retoid = 0xb <Address 0xb out of bounds> syncUUIDs = 0x7ffd68cf57c0 len = 60 berbuf = { buffer = "\002\000\001", '\000' <repeats 29 times>, "\260\230:T\375\177\000\000\005\231:T\375\177\000\000\005\231:T\375\177", '\000' <repeats 34 times>, "i\215\064\251\375\177\000\000\000\000\000\000\000\000\000\000x\021V\251\375\177\000\000\340\022h\001", '\000' <repeats 36 times>, "\060S\317h\375\177\000\000Q\020", '\000' <repeats 14 times>, "X", '\000' <repeats 15 times>, "\300U\317h\375\177\000\000"..., ialign = 65538, lalign = 65538, falign = 9.18382988e-41, dalign = 3.2380074297143616e-319, palign = 0x10002 <Address 0x10002 out of bounds>} msg = 0x7ffd543f3c50 syncCookie = {ctxcsn = 0x7ffd543f3f10, sids = 0x7ffd542eb890, numcsns = 1, rid = 2, octet_str = {bv_len = 60, bv_val = 0x7ffd5440f6d0 "rid=002,sid=002,csn=20141129152404.971057Z#000000#001#000000"}, sid = 2, sc_next = {stqe_next = 0x0}} syncCookie_req = {ctxcsn = 0x7ffd5440f380, sids = 0x7ffd542ebae0, numcsns = 5, rid = 2, octet_str = {bv_len = 224, bv_val = 0x7ffd543714e0 "rid=002,sid=004,csn=20141129152404.970764Z#000000#001#000000;20141129151341.491595Z#000000#002#000000;20141129151341.507685Z#000000#003#000000;20141129151341.523508Z#000000#004#000000;20141129151341.5"...}, sid = 4, sc_next = {stqe_next = 0x0}} rc = 100 err = 0 modlist = 0x0 m = 32765 tout = {tv_sec = 0, tv_usec = 0} refreshDeletes = 0 empty = "empty" (gdb) p *si $1 = {si_next = 0x1682200, si_be = 0x1680940, si_wbe = 0x1680940, si_re = 0x1682090, si_rid = 2, si_ridtxt = "rid=002", si_bindconf = {sb_uri = {bv_len = 22, bv_val = 0x16816b0 "ldap://10.2.0.1:11113/"}, sb_version = 3, sb_tls = 0, sb_method = 128, sb_timeout_api = 10, sb_timeout_net = 0, sb_binddn = {bv_len = 19, bv_val = 0x1681690 "uid=replica,dc=ldap"}, sb_cred = {bv_len = 3, bv_val = 0x1682110 "xyz"}, sb_saslmech = {bv_len = 0, bv_val = 0x0}, sb_secprops = 0x0, sb_realm = {bv_len = 0, bv_val = 0x0}, sb_authcId = {bv_len = 0, bv_val = 0x0}, sb_authzId = {bv_len = 0, bv_val = 0x0}, sb_keepalive = {sk_idle = 1, sk_probes = 1, sk_interval = 1}, sb_tls_ctx = 0x0, sb_tls_cert = 0x0, sb_tls_key = 0x0, sb_tls_cacert = 0x0, sb_tls_cacertdir = 0x0, sb_tls_reqcert = 0x0, sb_tls_cipher_suite = 0x0, sb_tls_protocol_min = 0x0, sb_tls_crlcheck = 0x0, sb_tls_do_init = 0}, si_base = {bv_len = 15, bv_val = 0x1682030 "dc=ngdr,dc=ldap"}, si_logbase = { bv_len = 0, bv_val = 0x0}, si_filterstr = {bv_len = 15, bv_val = 0x1681200 "(objectclass=*)"}, si_filter = 0x1682010, si_logfilterstr = {bv_len = 0, bv_val = 0x0}, si_contextdn = {bv_len = 7, bv_val = 0x16811e0 "dc=ldap"}, si_scope = 2, si_attrsonly = 0, si_anfile = 0x0, si_anlist = 0x1682130, si_exanlist = 0x1681fc0, si_attrs = 0x169f0f0, si_exattrs = 0x0, si_allattrs = 1, si_allopattrs = 1, si_schemachecking = 0, si_type = 3, si_ctype = 3, si_interval = 60, si_retryinterval = 0x1681ff0, si_retrynum_init = 0x1682070, si_retrynum = 0x1682050, si_syncCookie = { ctxcsn = 0x7ffd7c1d87d0, sids = 0x7ffd54411150, numcsns = 5, rid = 2, octet_str = {bv_len = 224, bv_val = 0x7ffd543f57f0 "rid=002,sid=004,csn=20141129152404.970905Z#000000#001#000000;20141129151341.491595Z#000000#002#000000;20141129151341.507685Z#000000#003#000000;20141129151341.523508Z#000000#004#000000;20141129151341.5"...}, sid = 4, sc_next = {stqe_next = 0x0}}, si_cookieState = 0x1681c70, si_cookieAge = 1539004, si_manageDSAit = 0, si_slimit = 0, si_tlimit = 0, si_refreshDelete = 0, si_refreshPresent = 1, si_refreshDone = 1, si_syncdata = 0, si_logstate = 0, si_got = 269715, si_strict_refresh = 0, si_too_old = 0, si_msgid = 2, si_presentlist = 0x7ffd7c58e050, si_ld = 0x7ffd7c718210, si_conn = 0x7ffda9a365d0, si_nonpresentlist = {lh_first = 0x0}, si_rewrite = 0x0, si_suffixm = {bv_len = 0, bv_val = 0x0}, si_mutex = {__data = {__lock = 1, __count = 0, __owner = 13107, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\063\063\000\000\001", '\000' <repeats 26 times>, __align = 1}} (gdb) p si->si_ridtxt $2 = "rid=002" (gdb) p (void*)si->si_ridtxt $3 = (void *) 0x1681d24
Some of Valgrind's DRD output, seems be enough to crash around connect/disconnect and walking on overlays.
Partially fixed. Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master. Leonid. -- The attached files is derived from OpenLDAP Software. All of the modifications to OpenLDAP Software represented in the following patch(es) were developed by Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights and/or interest in this work to any party. I, Leonid Yuriev am authorized by Peter-Service LLC, my employer, to release this work under the following terms. Peter-Service LLC hereby places the following modifications to OpenLDAP Software (and only these modifications) into the public domain. Hence, these modifications may be freely used and/or redistributed for any purpose with or without attribution and/or other notice.
Leonid Yuriev wrote: > Partially fixed. > Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master. Thanks, the patch makes sense. But if only partial, what else is still crashing? > > Leonid. > > -- > > The attached files is derived from OpenLDAP Software. All of the > modifications > to OpenLDAP Software represented in the following patch(es) were > developed by > Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned > rights > and/or interest in this work to any party. I, Leonid Yuriev am > authorized by > Peter-Service LLC, my employer, to release this work under the following > terms. > > Peter-Service LLC hereby places the following modifications to OpenLDAP > Software > (and only these modifications) into the public domain. Hence, these > modifications may be freely used and/or redistributed for any purpose > with or > without attribution and/or other notice. > > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
2014-12-02 6:09 GMT+03:00 Howard Chu <hyc@symas.com>: > Leonid Yuriev wrote: >> >> Partially fixed. >> Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master. > > > Thanks, the patch makes sense. But if only partial, what else is still > crashing? Stable crash in a 5 seconds after the 4x-cluster resumes from split-brain condition: Program terminated with signal SIGSEGV, Segmentation fault. #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116 116 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Нет такого файла или каталога. (gdb) bt #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116 #1 0x00000000004aee11 in memcpy (__len=<optimised out>, __src=<optimised out>, __dest=<optimised out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:51 #2 mdb_search (op=0x7f16b4ffa7c0, rs=0x7f16b4e697b0) at search.c:993 #3 0x000000000048a52a in overlay_op_walk (op=op@entry=0x7f16b4ffa7c0, rs=0x7f16b4ff9c40, which=op_search, oi=0xc31a30, on=<optimised out>) at backover.c:676 #4 0x000000000048a681 in over_op_func (op=0x7f16b4ffa7c0, rs=<optimised out>, which=<optimised out>) at backover.c:729 #5 0x000000000047de37 in syncrepl_del_nonpresent (op=0x7f16b4ffa7c0, si=0xc31490, uuids=<optimised out>, m=3, sc=<optimised out>, sc=<optimised out>) at syncrepl.c:3400 #6 0x0000000000481a0e in do_syncrep2 (op=0x7f16b4ffa7c0, si=0xc31490) at syncrepl.c:1346 #7 0x00000000004839e3 in do_syncrepl (ctx=<optimised out>, arg=0xc319d0) at syncrepl.c:1550 #8 0x00007f170283ecf2 in ldap_int_thread_pool_wrapper (xpool=0xbe4090) at tpool.c:688
> 2014-12-02 6:09 GMT+03:00 Howard Chu <hyc@symas.com>: >> Leonid Yuriev wrote: >>> Partially fixed. >>> Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master. >> >> Thanks, the patch makes sense. But if only partial, what else is still >> crashing? > Stable crash in a 5 seconds after the 4x-cluster resumes from > split-brain condition: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 __memcpy_sse2_unaligned () at > ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116 > 116 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Нет такого > файла или каталога. > (gdb) bt > #0 __memcpy_sse2_unaligned () at > ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116 > #1 0x00000000004aee11 in memcpy (__len=<optimised out>, > __src=<optimised out>, __dest=<optimised out>) at > /usr/include/x86_64-linux-gnu/bits/string3.h:51 > #2 mdb_search (op=0x7f16b4ffa7c0, rs=0x7f16b4e697b0) at search.c:993 Seems to be fixed completely. The last SIGSEGV is inducted by a "dreamcatcher" feature (ITS#7974). I already fixed the "dreamcatcher", currently we are re-testing it. Leonid.
changed notes changed state Open to Test moved from Incoming to Software Bugs
I saw that the master is lost some changes of the my original patch (that was early attached to ITS). Please see attached diff. I think it is a race condition around si_cookieState inside for-loop.
Leonid Yuriev wrote: > I saw that the master is lost some changes of the my original patch > (that was early attached to ITS). > > Please see attached diff. > I think it is a race condition around si_cookieState inside for-loop. Prove it. The fields being accessed inside the loop are for the pending values, and the pending mutex has already been acquired outside the loop. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Test to Release
fixed in master fixed in RE25 fixed in RE24
changed notes changed state Release to Closed