Issue 8354 - syncprov segv
Summary: syncprov segv
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.43
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-21 18:23 UTC by tpretz@gmail.com
Modified: 2016-02-11 00:51 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description tpretz@gmail.com 2016-01-21 18:23:25 UTC
Full_Name: Tom Pressnell
Version: 2.4.43
OS: Debian 8
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (149.254.186.170)


Hi,

I have been testing 2.4.43+ITS#8336 as a candidate for production usage.
Compiled from source on Debian 8 (jessie) x86_64.

I have been experiencing segmentation faults in syncprov_matchops:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000562713 in syncprov_matchops (op=0x7ef1e8100c90,
opc=0x7ef1b8000cf0, saveit=1)
    at syncprov.c:1332
1332					op2.ors_filter = ss->s_op->ors_filter->f_and->f_next;

I have been running replication testing, pushing relativly high rates of add/mod
operations at a mdb master (NOSYNC) whilst a number of replication clients are
connecting and disconnecting (simulating a lossy/faulty network) (killing of
replication client scripts / tcpkill).

Looking at ss:
$6 = {s_next = 0x7ef1b389f350, s_si = 0xe51cc0, s_base = {bv_len = 13,
    bv_val = 0x7ef172fb72a0 "dc=xyz,dc=com"}, s_eid = 1, s_op = 0x7ef1b4000aa0,
s_rid = 0, s_sid = 0,
  s_filterstr = {bv_len = 15, bv_val = 0x7ef1b4001248 "(objectClass=*)"},
s_flag= D 17, s_inuse = 1,
  s_res = 0x7ef172fa71f0, s_restail = 0x7ef172f54450, s_mutex = {__data =
{__lock = 1, __count = 0,
      __owner = 1006, __nusers = 1, __kind = 0, __spins = 0, __elision = 0,
__list = {__prev = 0x0,
        __next = 0x0}}C%C
    __size = "\001\000\000\000\000\000\000\000\356\003\000\000\001", '\000'
<repeats 26 times>, __align = 1}}

And at s_op->ors_filter:

(gdb) p *ss->s_op->o_request->oq_search->rs_filter
$2 = {f_choice = 161, f_un = {f_un_result = -1275063480, f_un_desc =
0x7ef1b4001348,
    f_un_ava = 0x7ef1b4001348, f_un_ssa = 0x7ef1b4001348, f_un_mra =
0x7ef1b4001348,
    f_un_complex = 0x7ef1b4001348}, f_next = 0x0}
(gdb) p ss->s_op->o_request->oq_search->rs_filterstr
$3 = {bv_len = 23, bv_val = 0x7ef1b40013e0 "(|(cn=4594)(cn=4594:1))"}

This is not the filter used by my syncrepl clients during this test (they all
run with objectClass=* as show in ss->s_filterstr), this is one of the filters
used by the add/mod script.

Looking at another thread (cutting down output):
[Switching to thread 2 (Thread 0x7ef1bffff700 (LWP 2479))]
#0  0x00000000004eeb59 in mdb_node_search (mc=0x7ef172ee63f0,
key=0x7ef1bfe6d3c0, exactp=0x7ef1bfe6d03c)
(gdb) bt
#5  0x0000000000553326 in mdb_id2entry (op=0x7ef1b4000aa0, mc=0x7ef172ee63f0,
id=26, e%3x7x7ef1bfe7d678)
    at id2entry.c:153

This thread is working with the same operation ...0aa0, but performing a
standard search as i would expect given the filter value.
Somehow ss->s_op seems to have ended up pointing at what seems to be an
unreleated operation.

Looking at the code i believe the issue could trigger when an op is abandoned
early before syncprov_op_search has got hold of the si_ops lock for the psearch
sop.
I have added a standard o_abandon check and return at line 2574 of syncprov.c
while the si_ops lock is held, before sop is added to the list.
This seems to have fixed the issue in my testing, i can see this code path is
traversed (as i am logging it) a number of times over the last few days of
running the tests.

I can provide more detailed backtraces if required.
If you would like core dumps this will require extra time as i would have to
replicate the test with non company data / schemas.

Thanks

Tom
Comment 1 Howard Chu 2016-01-22 17:44:43 UTC
tpretz@gmail.com wrote:

> This thread is working with the same operation ...0aa0, but performing a
> standard search as i would expect given the filter value.
> Somehow ss->s_op seems to have ended up pointing at what seems to be an
> unreleated operation.
>
> Looking at the code i believe the issue could trigger when an op is abandoned
> early before syncprov_op_search has got hold of the si_ops lock for the psearch
> sop.

OK, this sort of makes sense. But abandoning the op doesn't free it 
immediately, the frontend can't do that until all of the request functions 
have returned. And the op structure can't get freed and reused by some other 
op until after that.

> I have added a standard o_abandon check and return at line 2574 of syncprov.c
> while the si_ops lock is held, before sop is added to the list.
> This seems to have fixed the issue in my testing, i can see this code path is
> traversed (as i am logging it) a number of times over the last few days of
> running the tests.

But this can't be an adequate fix, since op->o_abandon could be set at any 
time after line 2574 as well.

> I can provide more detailed backtraces if required.


> If you would like core dumps this will require extra time as i would have to
> replicate the test with non company data / schemas.
>
> Thanks
>
> Tom
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 tpretz@gmail.com 2016-01-22 18:42:54 UTC
Hi,

Looking beyond syncprov.c:2574 o_abandon can be set, however as i
understand it once the syncop has been added to the si_ops list its
cleanup will be handled by syncprov_op_abandon, we would just need to
ensure it has not traversed syncprov_op_abandon before being added to
the list.

Reading the rest of syncprov_op_search and syncprov_search_response i
notice that syncprov_search_response does check o_abandon (when
deciding if to detach the psearch).

In the event of an abandon syncprov_search_response seems to assume
syncprov_op_abandon will take care of the cleanup of the syncop, which
it wouldn't if syncprov_op_abandon had already been executed before
the syncop was added to the si_ops list @~ 2574

I would then assume the op is freed, but the syncop remains in the
list to be processed. Once reallocated and the o_abandon field is no
longer set syncprov_matchops would start to include the syncop in its
processing hitting potential issues such as this segv.

Let me know if this ties up with how things really work or if i have
misunderstood something along the way.

Thanks

Tom

Comment 3 Howard Chu 2016-01-22 20:43:42 UTC
Thomas Pressnell wrote:
> Hi,
>
> Looking beyond syncprov.c:2574 o_abandon can be set, however as i
> understand it once the syncop has been added to the si_ops list its
> cleanup will be handled by syncprov_op_abandon, we would just need to
> ensure it has not traversed syncprov_op_abandon before being added to
> the list.
>
> Reading the rest of syncprov_op_search and syncprov_search_response i
> notice that syncprov_search_response does check o_abandon (when
> deciding if to detach the psearch).
>
> In the event of an abandon syncprov_search_response seems to assume
> syncprov_op_abandon will take care of the cleanup of the syncop, which
> it wouldn't if syncprov_op_abandon had already been executed before
> the syncop was added to the si_ops list @~ 2574

OK, thanks. A fix is now in git master. Probably about the same as the one you 
already described, but please test and feedback.
>
> I would then assume the op is freed, but the syncop remains in the
> list to be processed. Once reallocated and the o_abandon field is no
> longer set syncprov_matchops would start to include the syncop in its
> processing hitting potential issues such as this segv.
>
> Let me know if this ties up with how things really work or if i have
> misunderstood something along the way.
>
> Thanks
>
> Tom
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 Howard Chu 2016-01-22 20:49:07 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 5 tpretz@gmail.com 2016-01-23 14:43:08 UTC
Thanks,

Essentially the same patch as i had (destroyed sop->s_mutex in abandon
check rather than init later).

I did place the o_abandon test under the while( si->si_active ) loop
instead of above just to ensure the check is under the same lock hold
as the list modification.

Ill grab whats in git and start tests on monday

Have a good weekend

Tom

Comment 6 Howard Chu 2016-01-23 16:05:01 UTC
Thomas Pressnell wrote:
> Thanks,
>
> Essentially the same patch as i had (destroyed sop->s_mutex in abandon
> check rather than init later).
>
> I did place the o_abandon test under the while( si->si_active ) loop
> instead of above just to ensure the check is under the same lock hold
> as the list modification.

Ah good point. Yeah, I'll move the check down.
>
> Ill grab whats in git and start tests on monday
>
> Have a good weekend
>
> Tom
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 Quanah Gibson-Mount 2016-01-26 18:49:40 UTC
--On Saturday, January 23, 2016 2:43 PM +0000 tpretz@gmail.com wrote:

> Thanks,
>
> Essentially the same patch as i had (destroyed sop->s_mutex in abandon
> check rather than init later).
>
> I did place the o_abandon test under the while( si->si_active ) loop
> instead of above just to ensure the check is under the same lock hold
> as the list modification.
>
> Ill grab whats in git and start tests on monday

Hi Tom,

Are you able to confirm if the fix is good?  I'm working on 2.4.44, and 
want to be sure this fix is included.

Thanks!

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 8 tpretz@gmail.com 2016-01-27 09:33:34 UTC
On Tue, Jan 26, 2016 at 6:49 PM, Quanah Gibson-Mount <quanah@zimbra.com> wrote:
> --On Saturday, January 23, 2016 2:43 PM +0000 tpretz@gmail.com wrote:
>
>> Thanks,
>>
>> Essentially the same patch as i had (destroyed sop->s_mutex in abandon
>> check rather than init later).
>>
>> I did place the o_abandon test under the while( si->si_active ) loop
>> instead of above just to ensure the check is under the same lock hold
>> as the list modification.
>>
>> Ill grab whats in git and start tests on monday
>
>
> Hi Tom,
>
> Are you able to confirm if the fix is good?  I'm working on 2.4.44, and want
> to be sure this fix is included.

I'v had an instance running with this patch for a few days, no SEGV,
looks good to me.

Thanks
>
> Thanks!
>
> --Quanah
>
>
> --
>
> Quanah Gibson-Mount
> Platform Architect
> Zimbra, Inc.
> --------------------
> Zimbra ::  the leader in open source messaging and collaboration

Comment 9 Quanah Gibson-Mount 2016-01-27 18:25:38 UTC
--On Wednesday, January 27, 2016 9:33 AM +0000 Thomas Pressnell 
<tpretz@gmail.com> wrote:


>> Hi Tom,
>>
>> Are you able to confirm if the fix is good?  I'm working on 2.4.44, and
>> want to be sure this fix is included.
>
> I'v had an instance running with this patch for a few days, no SEGV,
> looks good to me.

Great, thanks for confirming. :)

--Quanah



--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 10 Quanah Gibson-Mount 2016-01-29 01:00:51 UTC
changed notes
changed state Test to Release
Comment 11 OpenLDAP project 2016-02-11 00:51:41 UTC
fixed in master
fixed in RE25
fixed in RE24 (2.4.44)
Comment 12 Quanah Gibson-Mount 2016-02-11 00:51:41 UTC
changed notes
changed state Release to Closed