Issue 7968 - SIGSEGV shortly after reconnection performed by syncrepl due to synchronization conflicts
Summary: SIGSEGV shortly after reconnection performed by syncrepl due to synchronizati...
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.40
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-15 21:24 UTC by Leonid Yuriev
Modified: 2015-07-02 17:45 UTC (History)
0 users

See Also:


Attachments
valgrind-drd.log.gz (3.14 KB, application/gzip)
2014-11-29 19:04 UTC, Leonid Yuriev
Details
its#7968-lost.patch (1.30 KB, patch)
2014-12-10 04:58 UTC, Leonid Yuriev
Details
testcase-its7968-1.tar.gz (3.72 KB, application/gzip)
2014-11-29 14:41 UTC, Leonid Yuriev
Details
slapd-syncrepl-locking.patch (3.39 KB, patch)
2014-12-01 23:13 UTC, Leonid Yuriev
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Leonid Yuriev 2014-10-15 21:24:43 UTC
Full_Name: Leonid Yuriev
Version: 2.4.40
OS: RHEL7
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (31.130.36.33)


Currently there is a couple of backtraces only.
This is the result of a stress test of replication in the presence of
sync-conflicts after a "split brain" case.
No any network troubles (just a loopback connections).

** Signal 11 (Segmentation fault), address is 0x797a from 0x50e8bd
(0) /opt/openldap.devel/libexec/slapd() [0x50e8bd]: syncprov_op_abandon
/home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1134
(1) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk
/home/ly/Projects/openldap.git/servers/slapd/backover.c:662
(2) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func
/home/ly/Projects/openldap.git/servers/slapd/backover.c:724
(3) /opt/openldap.devel/libexec/slapd() [0x4429a7]: fe_op_abandon
/home/ly/Projects/openldap.git/servers/slapd/abandon.c:134 (discriminator 2)
(4) /opt/openldap.devel/libexec/slapd() [0x42283c]: connection_abandon
/home/ly/Projects/openldap.git/servers/slapd/connection.c:740 (discriminator 3)
(5) /opt/openldap.devel/libexec/slapd() [0x424509]: connection_closing
/home/ly/Projects/openldap.git/servers/slapd/connection.c:829
(6) /opt/openldap.devel/libexec/slapd() [0x4250ef]: connection_read
/home/ly/Projects/openldap.git/servers/slapd/connection.c:1477
connection_read_thread
/home/ly/Projects/openldap.git/servers/slapd/connection.c:1284
(7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f8c6de89cf2]: ??

*** Signal 11 (Segmentation fault), address is 0x7f7a1e1c5000 from
0x7f7a1eebfb54
(0) /lib/x86_64-linux-gnu/libc.so.6(+0x98b54) [0x7f7a1eebfb54]: ????:0
(1) /opt/openldap.devel/libexec/slapd() [0x4afa7d]: mdb_search
/home/ly/Projects/openldap.git/servers/slapd/back-mdb/search.c:987
(discriminator 3)
(2) /opt/openldap.devel/libexec/slapd() [0x48b356]: 543edda2 overlay_op_walk
/home/ly/Projects/openldap.git/servers/slapd/backover.c:674
(3) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func
/home/ly/Projects/openldap.git/servers/slapd/backover.c:724
(4) /opt/openldap.devel/libexec/slapd() [0x427201]: fe_op_search
/home/ly/Projects/openldap.git/servers/slapd/search.c:402
(5) /opt/openldap.devel/libexec/slapd() [0x426c0c]: do_search
/home/ly/Projects/openldap.git/servers/slapd/search.c:247
(6) /opt/openld.d.devel/libexec/slapd() [0x424b54]: connection_operation
/home/ly/Projects/openldap.git/servers/slapd/connection.c:1158
(7) /opt/openldap.devel/libexec/slapd() [0x42526c]: connection_read_thread
/home/ly/Projects/openldap.git/servers/slapd/connection.c:1291
(8) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f7a1f629cf2]: ??
Comment 1 Leonid Yuriev 2014-10-15 22:25:45 UTC
One more case

*** Signal 11 (Segmentation fault), address is 0xb from 0x442f07

(0) /opt/openldap.devel/libexec/slapd() [0x442f07]: test_filter
/home/ly/Projects/openldap.git/servers/slapd/filterentry.c:69
(1) /opt/openldap.devel/libexec/slapd() [0x514721]: syncprov_matchops
/home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1316
(2) /opt/openldap.devel/libexec/slapd() [0x514b83]: syncprov_op_mod
/home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:2145
(3) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk
/home/ly/Projects/openldap.git/servers/slapd/backover.c:662
(4) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func
/home/ly/Projects/openldap.git/servers/slapd/backover.c:724
(5) /opt/openldap.devel/libexec/slapd() [0x4811a6]: syncrepl_entry
/home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:3177
do_syncrep2
/home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:1024
(6) /opt/openldap.devel/libexec/slapd() [0x4844b2]: do_syncrepl
/home/ly/Projects/openldap.git/servers/slapd/syncrepl.c:1539
(7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f398891acf2]: ??

Comment 2 Howard Chu 2014-10-16 06:40:43 UTC
leo@yuriev.ru wrote:
> Full_Name: Leonid Yuriev
> Version: 2.4.40
> OS: RHEL7
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (31.130.36.33)
>
>
> Currently there is a couple of backtraces only.
> This is the result of a stress test of replication in the presence of
> sync-conflicts after a "split brain" case.
> No any network troubles (just a loopback connections).

Looks like it's accessing freed memory, and it's also related to Abandon 
processing. You might be running into ITS#7967 as well. If you can reproduce 
this, try running with a malloc debugger.
>
> ** Signal 11 (Segmentation fault), address is 0x797a from 0x50e8bd
> (0) /opt/openldap.devel/libexec/slapd() [0x50e8bd]: syncprov_op_abandon
> /home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1134
> (1) /opt/openldap.devel/libexec/slapd() [0x48b31a]: overlay_op_walk
> /home/ly/Projects/openldap.git/servers/slapd/backover.c:662
> (2) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func
> /home/ly/Projects/openldap.git/servers/slapd/backover.c:724
> (3) /opt/openldap.devel/libexec/slapd() [0x4429a7]: fe_op_abandon
> /home/ly/Projects/openldap.git/servers/slapd/abandon.c:134 (discriminator 2)
> (4) /opt/openldap.devel/libexec/slapd() [0x42283c]: connection_abandon
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:740 (discriminator 3)
> (5) /opt/openldap.devel/libexec/slapd() [0x424509]: connection_closing
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:829
> (6) /opt/openldap.devel/libexec/slapd() [0x4250ef]: connection_read
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:1477
> connection_read_thread
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:1284
> (7) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f8c6de89cf2]: ??
>
> *** Signal 11 (Segmentation fault), address is 0x7f7a1e1c5000 from
> 0x7f7a1eebfb54
> (0) /lib/x86_64-linux-gnu/libc.so.6(+0x98b54) [0x7f7a1eebfb54]: ????:0
> (1) /opt/openldap.devel/libexec/slapd() [0x4afa7d]: mdb_search
> /home/ly/Projects/openldap.git/servers/slapd/back-mdb/search.c:987
> (discriminator 3)
> (2) /opt/openldap.devel/libexec/slapd() [0x48b356]: 543edda2 overlay_op_walk
> /home/ly/Projects/openldap.git/servers/slapd/backover.c:674
> (3) /opt/openldap.devel/libexec/slapd() [0x48b4c1]: over_op_func
> /home/ly/Projects/openldap.git/servers/slapd/backover.c:724
> (4) /opt/openldap.devel/libexec/slapd() [0x427201]: fe_op_search
> /home/ly/Projects/openldap.git/servers/slapd/search.c:402
> (5) /opt/openldap.devel/libexec/slapd() [0x426c0c]: do_search
> /home/ly/Projects/openldap.git/servers/slapd/search.c:247
> (6) /opt/openld.d.devel/libexec/slapd() [0x424b54]: connection_operation
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:1158
> (7) /opt/openldap.devel/libexec/slapd() [0x42526c]: connection_read_thread
> /home/ly/Projects/openldap.git/servers/slapd/connection.c:1291
> (8) /opt/openldap.devel/lib/libldap_r-2.4.so.2(+0x10cf2) [0x7f7a1f629cf2]: ??
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Leonid Yuriev 2014-10-16 16:22:28 UTC
Once again SIGSEGV.

I think the problem is not here, but in a connection cancel/abandon code.
It seems like race conditions with asynchronous connection dropping.
But currently I not review enough of code.

/servers/slapd/overlays/syncprov.c
@@ -1307,21 +1307,21 @@ syncprov_matchops( Operation *op, opcookie
*opc, int saveit )
                        op2.o_hdr = &oh;
                        op2.o_extra = op->o_extra;
                        op2.o_callback = NULL;
                        if (ss->s_flags & PS_FIX_FILTER) {
                                /* Skip the AND/GE clause that we
stuck on in front. We
                                   would lose deletes/mods that happen
during the refresh
                                   phase otherwise (ITS#6555) */
                                op2.ors_filter =
ss->s_op->ors_filter->f_and->f_next;
                        }
                        ldap_pvt_thread_mutex_unlock( &ss->s_mutex );
                        rc = test_filter( &op2, e, op2.ors_filter );
                }

                Debug( LDAP_DEBUG_NONE, "syncprov_matchops: sid %03x
fscope %d rc %d\n",
                        ss->s_sid, fc.fscope, rc );


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f9762ffe700 (LWP 29507)]
test_filter (op=0x7f9762ffc210, e=0x7f96f19c37d8, f=0x20) at filterentry.c:69
69              if ( f->f_choice & SLAPD_FILTER_UNDEFINED ) {

(0) /opt/openldap.devel/libexec/slapd() [0x4430b7]: test_filter
/home/ly/Projects/openldap.git/servers/slapd/filterentry.c:69
(1) /opt/openldap.devel/libexec/slapd() [0x515081]: syncprov_matchops
/home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1317
(2) /opt/openldap.devel/libexec/slapd() [0x515f43]: syncprov_op_response
/home/ly/Projects/openldap.git/servers/slapd/overlays/syncprov.c:1941
(3) /opt/openldap.devel/libexec/slapd() [0x434163]: slap_response_play
/home/ly/Projects/openldap.git/servers/slapd/result.c:509
(4) /opt/openldap.devel/libexec/slapd() [0x4346ca]: send_ldap_response
/home/ly/Projects/openldap.git/servers/slapd/result.c:584
(5) /opt/openldap.devel/libexec/slapd() [0x435062]: slap_send_ldap_result
/home/ly/Projects/openldap.git/servers/slapd/result.c:861
(6) /opt/openldap.devel/libexec/slapd() [0x4cb2e9]: mdb_add
/home/ly/Projects/openldap.git/servers/slapd/back-mdb/add.c:434
(7) /opt/openldap.devel/libexec/slapd() [0x48b506]: overlay_op_walk
/home/ly/Projects/openldap.git/servers/slapd/backover.c:674
(8) /opt/openldap.devel/libexec/slapd() [0x48b671]: over_op_func
/home/ly/Projects/openldap.git/servers/slapd/backover.c:724

Comment 4 Leonid Yuriev 2014-10-17 01:28:04 UTC
I am cherry-picked the fix of ITS#7967 and other from master - no
changes in behavior, just a stable sigsegv.

Comment 5 Howard Chu 2014-10-17 08:14:06 UTC
leo@yuriev.ru wrote:
> I am cherry-picked the fix of ITS#7967 and other from master - no
> changes in behavior, just a stable sigsegv.

OK. We've had abandon cleanup issues in the area of code you highlighted 
before; just need a simple method to reproduce the error.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 6 Leonid Yuriev 2014-11-29 13:29:58 UTC
current OPENLDAP_REL_ENG_2_4 (6b26910 Silence compiler warning...) with 
merge-in current mdb.master (9a72292 ITS#7961,#7987 Re-fix txn init)
- cluster of 4 node, but on single machine.
- only localback network, no any failures.
- multi-master by config, but all writes come only to first node.

Testcase will be available shortly (config + script).

Core was generated by `/opt/openldap.devel/libexec/slapd -l LOCAL5 -d 0 
-s 0 -4 -h ldap://10.4.0.1:1114'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007ff68564d07b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ff68564d07b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 
  0x000000000047fb3c in do_syncrep2 (op=0x7ff62f7fd740, si=0x1c74190) at 
syncrepl.c:934
#2  0x00000000004838c3 in do_syncrepl (ctx=<optimised out>, 
arg=0x1c746d0) at syncrepl.c:1539
#3  0x00000000004250a8 in connection_read_thread (ctx=0x7ff62f7fdbd0, 
argv=0x35) at connection.c:1293
#4  0x00007ff685ceecf2 in ldap_int_thread_pool_wrapper (xpool=0x1c27090) 
at tpool.c:688
#5  0x00007ff6858b90a5 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007ff6855e684d in clone () from /lib/x86_64-linux-gnu/libc.so.6

> 				if ( !BER_BVISNULL( &syncCookie.octet_str ) )
> 				{
> 					slap_parse_sync_cookie( &syncCookie, NULL );
> 					if ( syncCookie.ctxcsn ) {
> 						int i, sid = slap_parse_csn_sid( syncCookie.ctxcsn );
> 						check_syncprov( op, si );
> 						for ( i =0; i<si->si_cookieState->cs_num; i++ ) {
> 							/* new SID */
> 							if ( sid < si->si_cookieState->cs_sids[i] )
> 								break;
> 							if ( si->si_cookieState->cs_sids[i] == sid ) {
syncrepl.c:934 > 								if ( ber_bvcmp( syncCookie.ctxcsn, 
&si->si_cookieState->cs_vals[i] ) <= 0 ) {
> 									bdn.bv_val[bdn.bv_len] = '\0';
> 									Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s CSN too old, ignoring %s (%s)\n",
> 										si->si_ridtxt, syncCookie.ctxcsn->bv_val, bdn.bv_val );
> 									ldap_controls_free( rctrls );
> 									rc = 0;
> 									si->si_too_old = 1;
> 									goto done;
> 								}
> 								si->si_too_old = 0;
> 								break;
> 							}
> 						}


Comment 7 Leonid Yuriev 2014-11-29 14:41:24 UTC
A simple testcase is attached.
All activity (add/delete/read) come via first node of 4x cluster.
Unfortunately a lot of time may be required to reproduce a bug 
(coredump), from 10 minutes up to 2-3 hours.

Leonid.
Comment 8 Leonid Yuriev 2014-11-29 15:21:51 UTC
Program terminated with signal 11, Segmentation fault.
#0  0x000000000047fb64 in do_syncrep2 (op=0x7f8c494e7740, si=0x19967e0) 
at syncrepl.c:892
892                    bdn.bv_val[bdn.bv_len] = '\0';
(gdb) bt
#0  0x000000000047fb64 in do_syncrep2 (op=0x7f8c494e7740, si=0x19967e0) 
at syncrepl.c:892
#1  0x0000000000483903 in do_syncrepl (ctx=<optimised out>, 
arg=0x1996410) at syncrepl.c:1551
#2  0x00000000004250e8 in connection_input (cri=<optimised out>, 
conn=<optimised out>) at connection.c:1732
#3  connection_read (cri=<optimised out>, s=<optimised out>) at 
connection.c:1460
#4  connection_read_thread (ctx=0x7f8c494e7bd0, argv=0x21) at 
connection.c:1284
#5  0x00007f8c6bb5bd22 in ldap_int_thread_pool_wrapper (xpool=0x194a090) 
at tpool.c:688
#6  0x00007f8c6b7260a5 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f8c6b45384d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) info locals
syncUUID = {{bv_len = 16, bv_val = 0x7f8c40106ad7 
"\023Z,P\f!\020\064\223e\177\031\357Ζ:"}, {bv_len = 0, bv_val = 0x0}}
cookie = {bv_len = 60, bv_val = 0x7f8c40106ae9 
"rid=001,sid=001,csn=20141129143806.208485Z#000000#001#000000"}
rctrls = 0x7f8c40104ad0
bdn = {bv_len = 34, bv_val = 0x7f8c40105209 
"cn=tablet,uid=1756,dc=ngdr,dc=ldap"}
si_tag = 1
syncstate = 1
retdata = 0x19eb878
retoid = 0x0
syncUUIDs = 0x0
len = 60
berbuf = {
   buffer = "\002\000\001", '\000' <repeats 29 times>, 
"\320j\020@\214\177\000\000%k\020@\214\177\000\000%k\020@\214\177", 
'\000' <repeats 34 times>, 
"i\315rk\214\177\000\000\000\000\000\000\000\000\000\000xQ\224k\214\177\000\000\340b\231\001", 
'\000' <repeats 36 times>, 
"\a\000\000\000\000\000\000\000\020c\231\001\000\000\000\000\a\000\000\000\000\000\000\000@b\231\001\000\000\000\000\006\000\000\000\000\000\000\000\300b\231\001\000\000\000\000"..., 
ialign = 65538, lalign = 65538, falign = 9.18382988e-41, dalign = 
3.2380074297143616e-319, palign = 0x10002 <Address 0x10002 out of bounds>}
msg = 0x7f8c40102be0
syncCookie = {ctxcsn = 0x7f8c401038f0, sids = 0x7f8c40105890, numcsns = 
1, rid = 1, octet_str = {bv_len = 60, bv_val = 0x7f8c40104e10 
"rid=001,sid=001,csn=20141129143806.208485Z#000000#001#000000"},
   sid = 1, sc_next = {stqe_next = 0x0}}
syncCookie_req = {ctxcsn = 0x7f8c40105e80, sids = 0x7f8c40105020, 
numcsns = 1, rid = 1, octet_str = {bv_len = 60, bv_val = 0x7f8c40105760 
"rid=001,sid=001,csn=20141129143806.208232Z#000000#001#000000"},
   sid = 1, sc_next = {stqe_next = 0x0}}
rc = 100
err = 0
modlist = 0x7f8c40106a70
m = 32652
tout = {tv_sec = 0, tv_usec = 0}
refreshDeletes = 0
empty = "empty"
(gdb)


Comment 9 Leonid Yuriev 2014-11-29 15:44:02 UTC
Program terminated with signal 11, Segmentation fault.
#0  0x00007ffda90d607b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffda90d607b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000047fb7c in do_syncrep2 (op=0x7ffd68cf5740, si=0x1681d00) 
at syncrepl.c:893
#2  0x0000000000483903 in do_syncrepl (ctx=<optimised out>, 
arg=0x1682090) at syncrepl.c:1551
#3  0x00000000004250e8 in connection_input (cri=<optimised out>, 
conn=<optimised out>) at connection.c:1732
#4  connection_read (cri=<optimised out>, s=<optimised out>) at 
connection.c:1460
#5  connection_read_thread (ctx=0x7ffd68cf5bd0, argv=0x25) at 
connection.c:1284
#6  0x00007ffda9777d22 in ldap_int_thread_pool_wrapper (xpool=0x1635090) 
at tpool.c:688
#7  0x00007ffda93420a5 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007ffda906f84d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) frame 1
#1  0x000000000047fb7c in do_syncrep2 (op=0x7ffd68cf5740, si=0x1681d00) 
at syncrepl.c:893
893                    Debug( LDAP_DEBUG_ANY, "do_syncrep2: %s malformed 
message (%s)\n",
(gdb) info locals
syncUUID = {{bv_len = 16, bv_val = 0x7ffd543a98b7 
"\177]\201N\f'\020\064\235\255Չ\v\247\331", <incomplete sequence \351>}, 
{bv_len = 128849018894, bv_val = 0x7ffd68cf4ec4 "\375\177"}}
cookie = {bv_len = 60, bv_val = 0x7ffd543a98c9 
"rid=002,sid=002,csn=20141129152404.971057Z#000000#001#000000"}
rctrls = 0x7ffd543f7130
bdn = {bv_len = 33, bv_val = 0x7ffd543706f7 
"cn=modem,uid=4711,dc=ngdr,dc=ldap"}
si_tag = 1
syncstate = 3
retdata = 0x7ffd68cf57c0
retoid = 0xb <Address 0xb out of bounds>
syncUUIDs = 0x7ffd68cf57c0
len = 60
berbuf = {
   buffer = "\002\000\001", '\000' <repeats 29 times>, 
"\260\230:T\375\177\000\000\005\231:T\375\177\000\000\005\231:T\375\177", '\000' 
<repeats 34 times>, 
"i\215\064\251\375\177\000\000\000\000\000\000\000\000\000\000x\021V\251\375\177\000\000\340\022h\001", 
'\000' <repeats 36 times>, "\060S\317h\375\177\000\000Q\020", '\000' 
<repeats 14 times>, "X", '\000' <repeats 15 times>, 
"\300U\317h\375\177\000\000"..., ialign = 65538, lalign = 65538, falign 
= 9.18382988e-41, dalign = 3.2380074297143616e-319, palign = 0x10002 
<Address 0x10002 out of bounds>}
msg = 0x7ffd543f3c50
syncCookie = {ctxcsn = 0x7ffd543f3f10, sids = 0x7ffd542eb890, numcsns = 
1, rid = 2, octet_str = {bv_len = 60, bv_val = 0x7ffd5440f6d0 
"rid=002,sid=002,csn=20141129152404.971057Z#000000#001#000000"},
   sid = 2, sc_next = {stqe_next = 0x0}}
syncCookie_req = {ctxcsn = 0x7ffd5440f380, sids = 0x7ffd542ebae0, 
numcsns = 5, rid = 2, octet_str = {bv_len = 224,
     bv_val = 0x7ffd543714e0 
"rid=002,sid=004,csn=20141129152404.970764Z#000000#001#000000;20141129151341.491595Z#000000#002#000000;20141129151341.507685Z#000000#003#000000;20141129151341.523508Z#000000#004#000000;20141129151341.5"...}, 
sid = 4, sc_next = {stqe_next = 0x0}}
rc = 100
err = 0
modlist = 0x0
m = 32765
tout = {tv_sec = 0, tv_usec = 0}
refreshDeletes = 0
empty = "empty"
(gdb) p *si
$1 = {si_next = 0x1682200, si_be = 0x1680940, si_wbe = 0x1680940, si_re 
= 0x1682090, si_rid = 2, si_ridtxt = "rid=002", si_bindconf = {sb_uri = 
{bv_len = 22, bv_val = 0x16816b0 "ldap://10.2.0.1:11113/"},
     sb_version = 3, sb_tls = 0, sb_method = 128, sb_timeout_api = 10, 
sb_timeout_net = 0, sb_binddn = {bv_len = 19, bv_val = 0x1681690 
"uid=replica,dc=ldap"}, sb_cred = {bv_len = 3,
       bv_val = 0x1682110 "xyz"}, sb_saslmech = {bv_len = 0, bv_val = 
0x0}, sb_secprops = 0x0, sb_realm = {bv_len = 0, bv_val = 0x0}, 
sb_authcId = {bv_len = 0, bv_val = 0x0}, sb_authzId = {bv_len = 0,
       bv_val = 0x0}, sb_keepalive = {sk_idle = 1, sk_probes = 1, 
sk_interval = 1}, sb_tls_ctx = 0x0, sb_tls_cert = 0x0, sb_tls_key = 0x0, 
sb_tls_cacert = 0x0, sb_tls_cacertdir = 0x0,
     sb_tls_reqcert = 0x0, sb_tls_cipher_suite = 0x0, 
sb_tls_protocol_min = 0x0, sb_tls_crlcheck = 0x0, sb_tls_do_init = 0}, 
si_base = {bv_len = 15, bv_val = 0x1682030 "dc=ngdr,dc=ldap"}, 
si_logbase = {
     bv_len = 0, bv_val = 0x0}, si_filterstr = {bv_len = 15, bv_val = 
0x1681200 "(objectclass=*)"}, si_filter = 0x1682010, si_logfilterstr = 
{bv_len = 0, bv_val = 0x0}, si_contextdn = {bv_len = 7,
     bv_val = 0x16811e0 "dc=ldap"}, si_scope = 2, si_attrsonly = 0, 
si_anfile = 0x0, si_anlist = 0x1682130, si_exanlist = 0x1681fc0, 
si_attrs = 0x169f0f0, si_exattrs = 0x0, si_allattrs = 1,
   si_allopattrs = 1, si_schemachecking = 0, si_type = 3, si_ctype = 3, 
si_interval = 60, si_retryinterval = 0x1681ff0, si_retrynum_init = 
0x1682070, si_retrynum = 0x1682050, si_syncCookie = {
     ctxcsn = 0x7ffd7c1d87d0, sids = 0x7ffd54411150, numcsns = 5, rid = 
2, octet_str = {bv_len = 224,
       bv_val = 0x7ffd543f57f0 
"rid=002,sid=004,csn=20141129152404.970905Z#000000#001#000000;20141129151341.491595Z#000000#002#000000;20141129151341.507685Z#000000#003#000000;20141129151341.523508Z#000000#004#000000;20141129151341.5"...}, 
sid = 4, sc_next = {stqe_next = 0x0}}, si_cookieState = 0x1681c70, 
si_cookieAge = 1539004, si_manageDSAit = 0, si_slimit = 0, si_tlimit = 
0, si_refreshDelete = 0,
   si_refreshPresent = 1, si_refreshDone = 1, si_syncdata = 0, 
si_logstate = 0, si_got = 269715, si_strict_refresh = 0, si_too_old = 0, 
si_msgid = 2, si_presentlist = 0x7ffd7c58e050,
   si_ld = 0x7ffd7c718210, si_conn = 0x7ffda9a365d0, si_nonpresentlist = 
{lh_first = 0x0}, si_rewrite = 0x0, si_suffixm = {bv_len = 0, bv_val = 
0x0}, si_mutex = {__data = {__lock = 1, __count = 0,
       __owner = 13107, __nusers = 1, __kind = 0, __spins = 0, __elision 
= 0, __list = {__prev = 0x0, __next = 0x0}},
     __size = "\001\000\000\000\000\000\000\000\063\063\000\000\001", 
'\000' <repeats 26 times>, __align = 1}}
(gdb) p si->si_ridtxt
$2 = "rid=002"
(gdb) p (void*)si->si_ridtxt
$3 = (void *) 0x1681d24


Comment 10 Leonid Yuriev 2014-11-29 19:04:58 UTC
Some of Valgrind's DRD output, seems be enough to crash around 
connect/disconnect and walking on overlays.
Comment 11 Leonid Yuriev 2014-12-01 23:13:21 UTC
Partially fixed.
Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master.

Leonid.

--

The attached files is derived from OpenLDAP Software. All of the 
modifications
to OpenLDAP Software represented in the following patch(es) were 
developed by
Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights
and/or interest in this work to any party. I, Leonid Yuriev am authorized by
Peter-Service LLC, my employer, to release this work under the following
terms.

Peter-Service LLC hereby places the following modifications to OpenLDAP 
Software
(and only these modifications) into the public domain. Hence, these
modifications may be freely used and/or redistributed for any purpose 
with or
without attribution and/or other notice.



Comment 12 Howard Chu 2014-12-02 03:09:41 UTC
Leonid Yuriev wrote:
> Partially fixed.
> Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master.

Thanks, the patch makes sense. But if only partial, what else is still 
crashing?
>
> Leonid.
>
> --
>
> The attached files is derived from OpenLDAP Software. All of the
> modifications
> to OpenLDAP Software represented in the following patch(es) were
> developed by
> Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned
> rights
> and/or interest in this work to any party. I, Leonid Yuriev am
> authorized by
> Peter-Service LLC, my employer, to release this work under the following
> terms.
>
> Peter-Service LLC hereby places the following modifications to OpenLDAP
> Software
> (and only these modifications) into the public domain. Hence, these
> modifications may be freely used and/or redistributed for any purpose
> with or
> without attribution and/or other notice.
>
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 13 Leonid Yuriev 2014-12-02 07:35:29 UTC
2014-12-02 6:09 GMT+03:00 Howard Chu <hyc@symas.com>:
> Leonid Yuriev wrote:
>>
>> Partially fixed.
>> Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master.
>
>
> Thanks, the patch makes sense. But if only partial, what else is still
> crashing?

Stable crash in a 5 seconds after the 4x-cluster resumes from
split-brain condition:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned () at
../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116
116 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Нет такого
файла или каталога.
(gdb) bt
#0  __memcpy_sse2_unaligned () at
../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116
#1  0x00000000004aee11 in memcpy (__len=<optimised out>,
__src=<optimised out>, __dest=<optimised out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2  mdb_search (op=0x7f16b4ffa7c0, rs=0x7f16b4e697b0) at search.c:993
#3  0x000000000048a52a in overlay_op_walk (op=op@entry=0x7f16b4ffa7c0,
rs=0x7f16b4ff9c40, which=op_search, oi=0xc31a30, on=<optimised out>)
at backover.c:676
#4  0x000000000048a681 in over_op_func (op=0x7f16b4ffa7c0,
rs=<optimised out>, which=<optimised out>) at backover.c:729
#5  0x000000000047de37 in syncrepl_del_nonpresent (op=0x7f16b4ffa7c0,
si=0xc31490, uuids=<optimised out>, m=3, sc=<optimised out>,
sc=<optimised out>) at syncrepl.c:3400
#6  0x0000000000481a0e in do_syncrep2 (op=0x7f16b4ffa7c0, si=0xc31490)
at syncrepl.c:1346
#7  0x00000000004839e3 in do_syncrepl (ctx=<optimised out>,
arg=0xc319d0) at syncrepl.c:1550
#8  0x00007f170283ecf2 in ldap_int_thread_pool_wrapper
(xpool=0xbe4090) at tpool.c:688

Comment 14 Leonid Yuriev 2014-12-02 10:50:36 UTC
> 2014-12-02 6:09 GMT+03:00 Howard Chu <hyc@symas.com>:
>> Leonid Yuriev wrote:
>>> Partially fixed.
>>> Patch is for current OPENLDAP_REL_ENG_2_4, but applicable for master.
>>
>> Thanks, the patch makes sense. But if only partial, what else is still
>> crashing?
> Stable crash in a 5 seconds after the 4x-cluster resumes from
> split-brain condition:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  __memcpy_sse2_unaligned () at
> ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116
> 116 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Нет такого
> файла или каталога.
> (gdb) bt
> #0  __memcpy_sse2_unaligned () at
> ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:116
> #1  0x00000000004aee11 in memcpy (__len=<optimised out>,
> __src=<optimised out>, __dest=<optimised out>) at
> /usr/include/x86_64-linux-gnu/bits/string3.h:51
> #2  mdb_search (op=0x7f16b4ffa7c0, rs=0x7f16b4e697b0) at search.c:993
Seems to be fixed completely.
The last SIGSEGV is inducted by a "dreamcatcher" feature (ITS#7974).
I already fixed the "dreamcatcher", currently we are re-testing it.

Leonid.

Comment 15 Howard Chu 2014-12-05 19:45:38 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 16 Leonid Yuriev 2014-12-10 04:58:30 UTC
I saw that the master is lost some changes of the my original patch 
(that was early attached to ITS).

Please see attached diff.
I think it is a race condition around si_cookieState inside for-loop.
Comment 17 Howard Chu 2014-12-10 07:59:45 UTC
Leonid Yuriev wrote:
> I saw that the master is lost some changes of the my original patch
> (that was early attached to ITS).
>
> Please see attached diff.
> I think it is a race condition around si_cookieState inside for-loop.

Prove it. The fields being accessed inside the loop are for the pending 
values, and the pending mutex has already been acquired outside the loop.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 18 Quanah Gibson-Mount 2014-12-11 00:23:16 UTC
changed notes
changed state Test to Release
Comment 19 OpenLDAP project 2015-07-02 17:45:28 UTC
fixed in master
fixed in RE25
fixed in RE24
Comment 20 Quanah Gibson-Mount 2015-07-02 17:45:28 UTC
changed notes
changed state Release to Closed