[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#8354) syncprov segv



Full_Name: Tom Pressnell
Version: 2.4.43
OS: Debian 8
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (149.254.186.170)


Hi,

I have been testing 2.4.43+ITS#8336 as a candidate for production usage.
Compiled from source on Debian 8 (jessie) x86_64.

I have been experiencing segmentation faults in syncprov_matchops:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000562713 in syncprov_matchops (op=0x7ef1e8100c90,
opc=0x7ef1b8000cf0, saveit=1)
    at syncprov.c:1332
1332					op2.ors_filter = ss->s_op->ors_filter->f_and->f_next;

I have been running replication testing, pushing relativly high rates of add/mod
operations at a mdb master (NOSYNC) whilst a number of replication clients are
connecting and disconnecting (simulating a lossy/faulty network) (killing of
replication client scripts / tcpkill).

Looking at ss:
$6 = {s_next = 0x7ef1b389f350, s_si = 0xe51cc0, s_base = {bv_len = 13,
    bv_val = 0x7ef172fb72a0 "dc=xyz,dc=com"}, s_eid = 1, s_op = 0x7ef1b4000aa0,
s_rid = 0, s_sid = 0,
  s_filterstr = {bv_len = 15, bv_val = 0x7ef1b4001248 "(objectClass=*)"},
s_flag= D 17, s_inuse = 1,
  s_res = 0x7ef172fa71f0, s_restail = 0x7ef172f54450, s_mutex = {__data =
{__lock = 1, __count = 0,
      __owner = 1006, __nusers = 1, __kind = 0, __spins = 0, __elision = 0,
__list = {__prev = 0x0,
        __next = 0x0}}C%C
    __size = "\001\000\000\000\000\000\000\000\356\003\000\000\001", '\000'
<repeats 26 times>, __align = 1}}

And at s_op->ors_filter:

(gdb) p *ss->s_op->o_request->oq_search->rs_filter
$2 = {f_choice = 161, f_un = {f_un_result = -1275063480, f_un_desc =
0x7ef1b4001348,
    f_un_ava = 0x7ef1b4001348, f_un_ssa = 0x7ef1b4001348, f_un_mra =
0x7ef1b4001348,
    f_un_complex = 0x7ef1b4001348}, f_next = 0x0}
(gdb) p ss->s_op->o_request->oq_search->rs_filterstr
$3 = {bv_len = 23, bv_val = 0x7ef1b40013e0 "(|(cn=4594)(cn=4594:1))"}

This is not the filter used by my syncrepl clients during this test (they all
run with objectClass=* as show in ss->s_filterstr), this is one of the filters
used by the add/mod script.

Looking at another thread (cutting down output):
[Switching to thread 2 (Thread 0x7ef1bffff700 (LWP 2479))]
#0  0x00000000004eeb59 in mdb_node_search (mc=0x7ef172ee63f0,
key=0x7ef1bfe6d3c0, exactp=0x7ef1bfe6d03c)
(gdb) bt
#5  0x0000000000553326 in mdb_id2entry (op=0x7ef1b4000aa0, mc=0x7ef172ee63f0,
id=26, e%3x7x7ef1bfe7d678)
    at id2entry.c:153

This thread is working with the same operation ...0aa0, but performing a
standard search as i would expect given the filter value.
Somehow ss->s_op seems to have ended up pointing at what seems to be an
unreleated operation.

Looking at the code i believe the issue could trigger when an op is abandoned
early before syncprov_op_search has got hold of the si_ops lock for the psearch
sop.
I have added a standard o_abandon check and return at line 2574 of syncprov.c
while the si_ops lock is held, before sop is added to the list.
This seems to have fixed the issue in my testing, i can see this code path is
traversed (as i am logging it) a number of times over the last few days of
running the tests.

I can provide more detailed backtraces if required.
If you would like core dumps this will require extra time as i would have to
replicate the test with non company data / schemas.

Thanks

Tom