Full_Name: Hallvard B Furuseth Version: HEAD OS: Linux URL: Submission from: (NULL) (129.240.6.233) Submitted by: hallvard slapd/cancel.c sets o_abandon before o_cancel. Thus it's possible for the canceled operation to obey o_abandon before o_cancel gets set. Though I had to insert some sleeps to achieve that. Either the operation is abandoned and the Cancel operation receives tooLate, or if the client unbinds/closes the connection fast enough Cancel will hang: slapd does not close the connection, and hangs on shutdown: "slapd shutdown: waiting for 1 operations/tasks to finish". Since the flags are not mutex-protected (at least not when read), it's not enough to move the o_cancel setting after o_abandon in the Cancel thread. The cancelled thread might still see the o_abandon change first. A fix could be to make o_abandon a bitmask which says whether the abandon is actually a cancel, but the Abandon and Cancel operations will still need a mutex to coordinate so that Abandon does not reset a Cancel bitflag. In any case, it'd be cleaner if an operation which reacts to o_abandon grabs some mutex before checking o_cancel. The problem was tested as follows: - sleep 0.2 sec after Statslog "DEL" and before setting SLAP_CANCEL_REQ. - log "ABANDONED" when send_ldap_response() abandons the operation. - Client: A python socket client which sends raw BER, no libldap: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('localhost', 3890)) s.send(delete("cn=test")) time.sleep(0.1) s.send(cancel()) # cancel last operation #sys.exit() time.sleep(0.4) s.send(unbind()) --> conn=0 fd=9 ACCEPT from IP=127.0.0.1:56945 (IP=127.0.0.1:3890) conn=0 op=0 DEL dn="cn=test" conn=0 op=1 EXT oid=1.3.6.1.1.8 conn=0 op=1 CANCEL msg=1 conn=0 op=0 ABANDONED conn=0 op=2 UNBIND conn=0 op=1 RESULT oid= err=120 text= conn=0 fd=9 closed <server closed connection, client exited> ^C slapd If the client exits after send(cancel()): conn=0 fd=9 ACCEPT from IP=127.0.0.1:48826 (IP=127.0.0.1:3890) conn=0 op=0 DEL dn="cn=test" conn=0 op=1 EXT oid=1.3.6.1.1.8 conn=0 op=1 CANCEL msg=1 conn=0 op=2 UNBIND conn=0 op=0 ABANDONED <not closing connection> ^C slapd daemon: shutdown requested and initiated. slapd shutdown: waiting for 1 operations/tasks to finish <slapd is hanging> kill -KILL <slapd> slapd.conf: include servers/slapd/schema/core.schema allow update_anon database ldif directory "." suffix "cn=test" Patches to slapd: Index: cancel.c --- cancel.c 21 Jan 2009 23:40:25 -0000 1.30 +++ cancel.c 11 May 2009 04:42:58 -0000 @@ -92,4 +92,8 @@ } + { + struct timeval timeout = { 0, 200000 }; + select(0, NULL, NULL, NULL, &timeout); + } o->o_cancel = SLAP_CANCEL_REQ; Index: delete.c --- delete.c 21 Jan 2009 23:40:26 -0000 1.144 +++ delete.c 11 May 2009 04:42:58 -0000 @@ -75,4 +75,8 @@ op->o_log_prefix, op->o_req_dn.bv_val, 0, 0, 0 ); + { + struct timeval timeout = { 0, 200000 }; + select(0, NULL, NULL, NULL, &timeout); + } if( op->o_req_ndn.bv_len == 0 ) { Debug( LDAP_DEBUG_ANY, "%s do_delete: root dse!\n", Index: result.c --- result.c 11 May 2009 02:23:51 -0000 1.331 +++ result.c 11 May 2009 04:42:58 -0000 @@ -418,4 +418,6 @@ if (( rs->sr_err == SLAPD_ABANDON || op->o_abandon ) && !op->o_cancel ) { rc = SLAPD_ABANDON; + Statslog( LDAP_DEBUG_STATS, + "%s ABANDONED\n", op->o_log_prefix, 0, 0, 0, 0 ); goto clean2; }
h.b.furuseth@usit.uio.no wrote: > Full_Name: Hallvard B Furuseth > Version: HEAD > OS: Linux > URL: > Submission from: (NULL) (129.240.6.233) > Submitted by: hallvard > > > slapd/cancel.c sets o_abandon before o_cancel. Thus it's possible for > the canceled operation to obey o_abandon before o_cancel gets set. Cancel is just best-effort, why not just set o_cancel before o_abandon? > Though I had to insert some sleeps to achieve that. > Either the operation is abandoned and the Cancel operation receives > tooLate, or if the client unbinds/closes the connection fast enough > Cancel will hang: slapd does not close the connection, and hangs on > shutdown: "slapd shutdown: waiting for 1 operations/tasks to finish". > > Since the flags are not mutex-protected (at least not when read), it's > not enough to move the o_cancel setting after o_abandon in the Cancel > thread. The cancelled thread might still see the o_abandon change > first. A fix could be to make o_abandon a bitmask which says whether > the abandon is actually a cancel, but the Abandon and Cancel operations > will still need a mutex to coordinate so that Abandon does not reset > a Cancel bitflag. In any case, it'd be cleaner if an operation which > reacts to o_abandon grabs some mutex before checking o_cancel. > > > The problem was tested as follows: > - sleep 0.2 sec after Statslog "DEL" and before setting SLAP_CANCEL_REQ. > - log "ABANDONED" when send_ldap_response() abandons the operation. > - Client: A python socket client which sends raw BER, no libldap: > s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > s.connect(('localhost', 3890)) > s.send(delete("cn=test")) > time.sleep(0.1) > s.send(cancel()) # cancel last operation > #sys.exit() > time.sleep(0.4) > s.send(unbind()) > --> > conn=0 fd=9 ACCEPT from IP=127.0.0.1:56945 (IP=127.0.0.1:3890) > conn=0 op=0 DEL dn="cn=test" > conn=0 op=1 EXT oid=1.3.6.1.1.8 > conn=0 op=1 CANCEL msg=1 > conn=0 op=0 ABANDONED > conn=0 op=2 UNBIND > conn=0 op=1 RESULT oid= err=120 text= > conn=0 fd=9 closed > <server closed connection, client exited> > ^C slapd > > If the client exits after send(cancel()): > conn=0 fd=9 ACCEPT from IP=127.0.0.1:48826 (IP=127.0.0.1:3890) > conn=0 op=0 DEL dn="cn=test" > conn=0 op=1 EXT oid=1.3.6.1.1.8 > conn=0 op=1 CANCEL msg=1 > conn=0 op=2 UNBIND > conn=0 op=0 ABANDONED > <not closing connection> > ^C slapd > daemon: shutdown requested and initiated. > slapd shutdown: waiting for 1 operations/tasks to finish > <slapd is hanging> > kill -KILL<slapd> > > slapd.conf: > include servers/slapd/schema/core.schema > allow update_anon > database ldif > directory "." > suffix "cn=test" > > Patches to slapd: > > Index: cancel.c > --- cancel.c 21 Jan 2009 23:40:25 -0000 1.30 > +++ cancel.c 11 May 2009 04:42:58 -0000 > @@ -92,4 +92,8 @@ > } > > + { > + struct timeval timeout = { 0, 200000 }; > + select(0, NULL, NULL, NULL,&timeout); > + } > o->o_cancel = SLAP_CANCEL_REQ; > > Index: delete.c > --- delete.c 21 Jan 2009 23:40:26 -0000 1.144 > +++ delete.c 11 May 2009 04:42:58 -0000 > @@ -75,4 +75,8 @@ > op->o_log_prefix, op->o_req_dn.bv_val, 0, 0, 0 ); > > + { > + struct timeval timeout = { 0, 200000 }; > + select(0, NULL, NULL, NULL,&timeout); > + } > if( op->o_req_ndn.bv_len == 0 ) { > Debug( LDAP_DEBUG_ANY, "%s do_delete: root dse!\n", > Index: result.c > --- result.c 11 May 2009 02:23:51 -0000 1.331 > +++ result.c 11 May 2009 04:42:58 -0000 > @@ -418,4 +418,6 @@ > if (( rs->sr_err == SLAPD_ABANDON || op->o_abandon )&& !op->o_cancel ) { > rc = SLAPD_ABANDON; > + Statslog( LDAP_DEBUG_STATS, > + "%s ABANDONED\n", op->o_log_prefix, 0, 0, 0, 0 ); > goto clean2; > } > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com writes: >> slapd/cancel.c sets o_abandon before o_cancel. Thus it's possible for >> the canceled operation to obey o_abandon before o_cancel gets set. > > Cancel is just best-effort, why not just set o_cancel before o_abandon? Memory coherence. Without mutex protection, the cancelled thread might still see the o_abandon change before the o_cancel change. And thus not send a response, nor update info for the Cancel operation to react to. -- Hallvard
Hallvard B Furuseth wrote: > hyc@symas.com writes: >>> slapd/cancel.c sets o_abandon before o_cancel. Thus it's possible for >>> the canceled operation to obey o_abandon before o_cancel gets set. >> >> Cancel is just best-effort, why not just set o_cancel before o_abandon? > > Memory coherence. Without mutex protection, the cancelled thread might > still see the o_abandon change before the o_cancel change. And thus not > send a response, nor update info for the Cancel operation to react to. > I'd prefer to condense it all into a bitfield. We already ensure that the c_mutex is held when setting these flags. It would be a lot of overhead to grab the mutex just to read the flag, and would require additional analysis to make sure the new locking behavior doesn't introduce new deadlocks. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu writes: > I'd prefer to condense it all into a bitfield. Right... keep both o_abandon and o_cancel (maybe renamed to o_dummy) for RE24 though, otherwise we have a silent binary incompatible change with already-compiled slapd modules. The change will in any case break modules that play with o_cancel though. > We already ensure that > the c_mutex is held when setting these flags. Not yet, but c_mutex it is:-) Must protect the following code. Or rather, bitflag code which will replace it: - o_cancel access in cancel.c, connection.c:connection_operation(), back-relay/op.c. - o_abandon setting in overlays/syncprov.c:syncprov_op_abandon(), and in overlays/retcode.c where hopefully the first case should have been if ( op2.o_abandon ) op->o_abandon = 1; so that we can never reset o_abandon once it is set. > It would be a lot of > overhead to grab the mutex just to read the flag, > and would require additional analysis to make sure the new locking > behavior doesn't introduce new deadlocks. Checking o_abandon!=0 works unmutexed. (Well, technically any such unmutexed access is wrong, but in practice we should be OK.) As long as we: - never reset o_abandon like retcode above can do, and - never unmutexed assume consistency betweeen abandon-related variables set by different threads. So the equivalent of this should be OK: if ( op->o_abandon ) rs->sr_err = SLAPD_ABANDON; ... if ( rs->sr_err == SLAPD_ABANDON ) { lock c_mutex; more careful o_abandon checks; unlock c_mutex; } While I'm looking at this: The ldap_pvt_thread_yield() in cancel.c is really ugly. There is no cond to attach to it to without breaking binary compatibility, but even a even a global mutex+cond might be better, with threads broadcasting on it when it might be relevant. It's still ugly though. Cancelling an expensive operation ought to be a friendly operation, but instead it is unfriendly in that yet another thread is occupied, solely to wait for that operation. What would make sense (for RE25) would be to redo Cancel so that the cancelled thread is responsible for sending the Cancel response or resubmitting an operation which will. Also, the be_cancel implementation in at least back-relay looks wrong. Cancel isn't dispatched on DN, and back-relay with no 'relay' directive configured cannot know where to dispatch. I haven't checked how it works, but it seems a call to be_cancel means the Cancel operation _might_ belong to this backend. But it might not, which means back-relay shouldn't set op->o_cancel in this case. What it should do I don't know though. And I don't know if syncprov is doing any better. -- Hallvard
changed notes moved from Incoming to Software Bugs
Well, this turned out a lot of problems. Currently blocked by ITS#6138 Bad Cancel/Abandon/"internal abandon"/Syncprov interactions. back-relay/op.c done (ITS#6133). syncprov/retcode issues may fit better under ITS#6138. Bitflags should be unnecessary as such, just need a set of values. Though it wouldn't hurt if they were chosen so one could use bit operations to reduce checks for "o_abandon == A or B". "Suppress response" value mentioned in ITS#6138 also needed, maybe that must be a bitflag when combined with Cancel stuff. Some tidbits uglifying a fix: * op->o_conn->c_mutex is locked when be->be_abandon() is called, but not when be->be_cancel() is called. * send_ldap_ber() needs the lock while it holds &conn->c_write1_mutex, but code elsewhere indicates the reverse lock order. Question: syncprov_op_abandon() holds &si->si_ops_mutex when setting so->s_op->o_abandon. Does it need that? The function needs to grab op->o_conn->c_mutex when called as Cancel, but must not do that while holding &si->si_ops_mutex since Abandon grabs the locks in the opposite order. -- Hallvard
h.b.furuseth@usit.uio.no wrote: > Well, this turned out a lot of problems. Currently blocked by > ITS#6138 Bad Cancel/Abandon/"internal abandon"/Syncprov interactions. > > back-relay/op.c done (ITS#6133). syncprov/retcode issues may fit better > under ITS#6138. > > Bitflags should be unnecessary as such, just need a set of values. > Though it wouldn't hurt if they were chosen so one could use bit > operations to reduce checks for "o_abandon == A or B". "Suppress > response" value mentioned in ITS#6138 also needed, maybe that must be a > bitflag when combined with Cancel stuff. Yes, it still seems some bitflags will be more appropriate. > Some tidbits uglifying a fix: > * op->o_conn->c_mutex is locked when be->be_abandon() is called, > but not when be->be_cancel() is called. > * send_ldap_ber() needs the lock while it holds&conn->c_write1_mutex, > but code elsewhere indicates the reverse lock order. > > Question: > syncprov_op_abandon() holds&si->si_ops_mutex when setting > so->s_op->o_abandon. Does it need that? The function needs to > grab op->o_conn->c_mutex when called as Cancel, but must not do > that while holding&si->si_ops_mutex since Abandon grabs the > locks in the opposite order. syncprov_op_abandon() can remove ops from the list, so yes, it must hold si_ops_mutex. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com writes: >> Bitflags should be unnecessary as such, just need a set of values. >> Though it wouldn't hurt if they were chosen so one could use bit >> operations to reduce checks for "o_abandon == A or B". "Suppress >> response" value mentioned in ITS#6138 also needed, maybe that must be a >> bitflag when combined with Cancel stuff. > > Yes, it still seems some bitflags will be more appropriate. As long as we decide which flag combinations are (not) valid. Problem with "unrestricted" bitflags compared to a list of values is that they allow and thus require support for flag combinations we have no real need to support. >> Question: >> syncprov_op_abandon() holds&si->si_ops_mutex when setting >> so->s_op->o_abandon. Does it need that? The function needs to >> grab op->o_conn->c_mutex when called as Cancel, but must not do >> that while holding&si->si_ops_mutex since Abandon grabs the >> locks in the opposite order. > > syncprov_op_abandon() can remove ops from the list, so yes, it must > hold si_ops_mutex. Right, but does it need si_ops_mutex while setting o_abandon? Cancel could remove the op, unlock si_ops_mutex, lock c_mutex, update o_abandon. Alternative: Lock c_mutex first (that must be safe since Abandon is doing it anyway), then lock si_ops_mutex. So it's not a critical difference, just that a lock must be held a bit longer. -- Hallvard
changed notes
changed notes changed state Open to Partial
Hrmph, the placement of the 'so->s_op->o_abandon = 1;' which I wanted to move out of the mutex lock derives from syncprov.c rev 1.20 commented "Fix abandon sequencing":-( -- Hallvard
syncprov (msg#7), then see ITS#6138. Some fixes in HEAD. Some fixes in RE24
Sent from my iPhone
This might be related or the same issue. With ITS#9124 and ITS#9145 fixed, if you run: ldapsearch -E '!sync=rp' -E '!cancel' syncprov has freed the operation and cancel_extop() still tries to touch the same Operation pointer without checking it's still in the connection list. At least valgrind isn't happy. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP