Full_Name: Tom Pressnell Version: 2.4.43 OS: Debian 8 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (149.254.186.170) Hi, I have been testing 2.4.43+ITS#8336 as a candidate for production usage. Compiled from source on Debian 8 (jessie) x86_64. I have been experiencing segmentation faults in syncprov_matchops: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000562713 in syncprov_matchops (op=0x7ef1e8100c90, opc=0x7ef1b8000cf0, saveit=1) at syncprov.c:1332 1332 op2.ors_filter = ss->s_op->ors_filter->f_and->f_next; I have been running replication testing, pushing relativly high rates of add/mod operations at a mdb master (NOSYNC) whilst a number of replication clients are connecting and disconnecting (simulating a lossy/faulty network) (killing of replication client scripts / tcpkill). Looking at ss: $6 = {s_next = 0x7ef1b389f350, s_si = 0xe51cc0, s_base = {bv_len = 13, bv_val = 0x7ef172fb72a0 "dc=xyz,dc=com"}, s_eid = 1, s_op = 0x7ef1b4000aa0, s_rid = 0, s_sid = 0, s_filterstr = {bv_len = 15, bv_val = 0x7ef1b4001248 "(objectClass=*)"}, s_flag= D 17, s_inuse = 1, s_res = 0x7ef172fa71f0, s_restail = 0x7ef172f54450, s_mutex = {__data = {__lock = 1, __count = 0, __owner = 1006, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}C%C __size = "\001\000\000\000\000\000\000\000\356\003\000\000\001", '\000' <repeats 26 times>, __align = 1}} And at s_op->ors_filter: (gdb) p *ss->s_op->o_request->oq_search->rs_filter $2 = {f_choice = 161, f_un = {f_un_result = -1275063480, f_un_desc = 0x7ef1b4001348, f_un_ava = 0x7ef1b4001348, f_un_ssa = 0x7ef1b4001348, f_un_mra = 0x7ef1b4001348, f_un_complex = 0x7ef1b4001348}, f_next = 0x0} (gdb) p ss->s_op->o_request->oq_search->rs_filterstr $3 = {bv_len = 23, bv_val = 0x7ef1b40013e0 "(|(cn=4594)(cn=4594:1))"} This is not the filter used by my syncrepl clients during this test (they all run with objectClass=* as show in ss->s_filterstr), this is one of the filters used by the add/mod script. Looking at another thread (cutting down output): [Switching to thread 2 (Thread 0x7ef1bffff700 (LWP 2479))] #0 0x00000000004eeb59 in mdb_node_search (mc=0x7ef172ee63f0, key=0x7ef1bfe6d3c0, exactp=0x7ef1bfe6d03c) (gdb) bt #5 0x0000000000553326 in mdb_id2entry (op=0x7ef1b4000aa0, mc=0x7ef172ee63f0, id=26, e%3x7x7ef1bfe7d678) at id2entry.c:153 This thread is working with the same operation ...0aa0, but performing a standard search as i would expect given the filter value. Somehow ss->s_op seems to have ended up pointing at what seems to be an unreleated operation. Looking at the code i believe the issue could trigger when an op is abandoned early before syncprov_op_search has got hold of the si_ops lock for the psearch sop. I have added a standard o_abandon check and return at line 2574 of syncprov.c while the si_ops lock is held, before sop is added to the list. This seems to have fixed the issue in my testing, i can see this code path is traversed (as i am logging it) a number of times over the last few days of running the tests. I can provide more detailed backtraces if required. If you would like core dumps this will require extra time as i would have to replicate the test with non company data / schemas. Thanks Tom
tpretz@gmail.com wrote: > This thread is working with the same operation ...0aa0, but performing a > standard search as i would expect given the filter value. > Somehow ss->s_op seems to have ended up pointing at what seems to be an > unreleated operation. > > Looking at the code i believe the issue could trigger when an op is abandoned > early before syncprov_op_search has got hold of the si_ops lock for the psearch > sop. OK, this sort of makes sense. But abandoning the op doesn't free it immediately, the frontend can't do that until all of the request functions have returned. And the op structure can't get freed and reused by some other op until after that. > I have added a standard o_abandon check and return at line 2574 of syncprov.c > while the si_ops lock is held, before sop is added to the list. > This seems to have fixed the issue in my testing, i can see this code path is > traversed (as i am logging it) a number of times over the last few days of > running the tests. But this can't be an adequate fix, since op->o_abandon could be set at any time after line 2574 as well. > I can provide more detailed backtraces if required. > If you would like core dumps this will require extra time as i would have to > replicate the test with non company data / schemas. > > Thanks > > Tom > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Hi, Looking beyond syncprov.c:2574 o_abandon can be set, however as i understand it once the syncop has been added to the si_ops list its cleanup will be handled by syncprov_op_abandon, we would just need to ensure it has not traversed syncprov_op_abandon before being added to the list. Reading the rest of syncprov_op_search and syncprov_search_response i notice that syncprov_search_response does check o_abandon (when deciding if to detach the psearch). In the event of an abandon syncprov_search_response seems to assume syncprov_op_abandon will take care of the cleanup of the syncop, which it wouldn't if syncprov_op_abandon had already been executed before the syncop was added to the si_ops list @~ 2574 I would then assume the op is freed, but the syncop remains in the list to be processed. Once reallocated and the o_abandon field is no longer set syncprov_matchops would start to include the syncop in its processing hitting potential issues such as this segv. Let me know if this ties up with how things really work or if i have misunderstood something along the way. Thanks Tom
Thomas Pressnell wrote: > Hi, > > Looking beyond syncprov.c:2574 o_abandon can be set, however as i > understand it once the syncop has been added to the si_ops list its > cleanup will be handled by syncprov_op_abandon, we would just need to > ensure it has not traversed syncprov_op_abandon before being added to > the list. > > Reading the rest of syncprov_op_search and syncprov_search_response i > notice that syncprov_search_response does check o_abandon (when > deciding if to detach the psearch). > > In the event of an abandon syncprov_search_response seems to assume > syncprov_op_abandon will take care of the cleanup of the syncop, which > it wouldn't if syncprov_op_abandon had already been executed before > the syncop was added to the si_ops list @~ 2574 OK, thanks. A fix is now in git master. Probably about the same as the one you already described, but please test and feedback. > > I would then assume the op is freed, but the syncop remains in the > list to be processed. Once reallocated and the o_abandon field is no > longer set syncprov_matchops would start to include the syncop in its > processing hitting potential issues such as this segv. > > Let me know if this ties up with how things really work or if i have > misunderstood something along the way. > > Thanks > > Tom > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Test moved from Incoming to Software Bugs
Thanks, Essentially the same patch as i had (destroyed sop->s_mutex in abandon check rather than init later). I did place the o_abandon test under the while( si->si_active ) loop instead of above just to ensure the check is under the same lock hold as the list modification. Ill grab whats in git and start tests on monday Have a good weekend Tom
Thomas Pressnell wrote: > Thanks, > > Essentially the same patch as i had (destroyed sop->s_mutex in abandon > check rather than init later). > > I did place the o_abandon test under the while( si->si_active ) loop > instead of above just to ensure the check is under the same lock hold > as the list modification. Ah good point. Yeah, I'll move the check down. > > Ill grab whats in git and start tests on monday > > Have a good weekend > > Tom > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Saturday, January 23, 2016 2:43 PM +0000 tpretz@gmail.com wrote: > Thanks, > > Essentially the same patch as i had (destroyed sop->s_mutex in abandon > check rather than init later). > > I did place the o_abandon test under the while( si->si_active ) loop > instead of above just to ensure the check is under the same lock hold > as the list modification. > > Ill grab whats in git and start tests on monday Hi Tom, Are you able to confirm if the fix is good? I'm working on 2.4.44, and want to be sure this fix is included. Thanks! --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
On Tue, Jan 26, 2016 at 6:49 PM, Quanah Gibson-Mount <quanah@zimbra.com> wrote: > --On Saturday, January 23, 2016 2:43 PM +0000 tpretz@gmail.com wrote: > >> Thanks, >> >> Essentially the same patch as i had (destroyed sop->s_mutex in abandon >> check rather than init later). >> >> I did place the o_abandon test under the while( si->si_active ) loop >> instead of above just to ensure the check is under the same lock hold >> as the list modification. >> >> Ill grab whats in git and start tests on monday > > > Hi Tom, > > Are you able to confirm if the fix is good? I'm working on 2.4.44, and want > to be sure this fix is included. I'v had an instance running with this patch for a few days, no SEGV, looks good to me. Thanks > > Thanks! > > --Quanah > > > -- > > Quanah Gibson-Mount > Platform Architect > Zimbra, Inc. > -------------------- > Zimbra :: the leader in open source messaging and collaboration
--On Wednesday, January 27, 2016 9:33 AM +0000 Thomas Pressnell <tpretz@gmail.com> wrote: >> Hi Tom, >> >> Are you able to confirm if the fix is good? I'm working on 2.4.44, and >> want to be sure this fix is included. > > I'v had an instance running with this patch for a few days, no SEGV, > looks good to me. Great, thanks for confirming. :) --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
changed notes changed state Test to Release
fixed in master fixed in RE25 fixed in RE24 (2.4.44)
changed notes changed state Release to Closed