6138 – Bad Cancel/Abandon/"internal abandon"/Syncprov interactions

Issue 6138 - Bad Cancel/Abandon/"internal abandon"/Syncprov interactions

Summary: Bad Cancel/Abandon/"internal abandon"/Syncprov interactions

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	High normal
Target Milestone:	2.6.0
Assignee:	Howard Chu

URL:
Keywords:	replication

Depends on:
Blocks:

Reported:	2009-05-22 21:50 UTC by Hallvard Furuseth
Modified:	2021-10-25 22:09 UTC (History)
CC List:	0 users

See Also:	6059 6103 6104 6133 6157 9220

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Hallvard Furuseth 2009-05-22 21:50:00 UTC

Full_Name: Hallvard B Furuseth
Version: HEAD
OS: 
URL: 
Submission from: (NULL) (129.240.6.233)
Submitted by: hallvard


Might make this a tracking issue for Cancel/Abandon problems.
And/or copy some to separate ITSes, but they all seem interconnected:


Cancel(abandoned operation)-requests are not rejected.  Thus slapd
sets o_cancel and turns an active Abandon into a Cancel.  Presumably
that can confuse Cancel/Abandon handlers, like that in Syncprov.

Similarly, Abandon(abandoned/cancelled operation) is not ignored.
connection_abandon() re-abandons abandoned/cancelled operations too.


However Syncprov:RefreshAndPersist abuses op->o_abandon: It sets it to
mean "Suppress the response.  A copy of this operation will send it."
So if Cancel(op with o_abandon!=0) is fixed to respond protocolError
"already abandoned", presumably that breaks Cancel(RefreshAndPersist).

I'm not touching it with a flagpole - I don't know syncprov.  Help?

Overlay retcode does something similar - sends a response and then sets
o_abandon.

Cancel/Abandon can in any case fail by targeting the wrong operation
though: A connection can have multiple messages with the same IDs when
the response is sent and the client reuses the message ID, before the
old operation in slapd can clean up and finish.

syncprov_op_abandon() identifies messages by (connid, msgid).  Can
multiple messages with the same ID break that, or more importantly,
break what gets sent/written with syncrepl?


Anyway, the current code needs an o_abandon value which means "suppress
response".  Or maybe "abandoned, except as far as future Abandon and
Cancel operations are concerned".  Syncprov and Retcode need to handle
the various possible orderings of Cancel/Abandon vs Suppress, including
when they forward o_abandon from one operation to another.  I haven't
looked too closely at that code either.

ITS#6104 (race condition with cancel operation) also calls for multiple
o_abandon values (joining some or all of o_cancel into o_abandon), and
must be considered when picking a value.  However I suggest to consider
just the various possible values first and get that right, and deal
with the race condition later.

Source and binary compatibility with third-party modules complicates
this, if we're trying to support both and old compiled slapd + new
compiled module and vice versa.  Might try a minimal solution now.

Comment 1 Hallvard Furuseth 2009-05-22 21:50:57 UTC

moved from Incoming to Software Bugs

Comment 2 Hallvard Furuseth 2009-06-02 19:49:25 UTC

back-ldap:extended.c also does "suppress response, it has been sent",
but does it by returning and setting rs->sr_err = SLAPD_ABANDON.  Might
break assumptions somewhere that SLAPD_ABANDON implies o_abandon was
set.  And I guess the hack fails if the operation gets cancelled.

========================================================================

I think these are the Operation states related to Cancel and Abandon:

op->o_abandon is set for these - could extend to multiple values:
A) Operation Abandoned/Cancelled by client.
B) Operation implicitly abandoned by client. (Bind or lost connection)
C) Operation abandoned by server.  (It wants to close the connection)
D) Suppress response - a duplicate of the operation will proceed. (syncprov)
E) Suppress response - final send_ldap_response() was done. (retcode overlay)

rs->sr_err == SLAPD_ABANDON if:
F) The backend obeyed o_abandon.  (Cancel op, if any, will succeed)
G=E) Suppress response - final send_ldap_response() was done. (back-ldap)

op->o_cancel packs these states/values:
H) The o_abandon is due to a Cancel.
I) Cancel operation wants a result, cancelled op must set it and wait.
J) Result is available to the Cancel operation.
K) Result. (LDAP result code, or SLAP_CANCEL_ACK for success)
L) Cancel operation has fetched result, cancelled operation can proceed.

States that fit in none of the above, or poorly so:

M) Operation must not be waited for, e.g. by Cancel.
   Operation is itself waiting for others, e.g. cn=config update.

N) Operation invisible to Abandon/Cancel/internal abandon.
   msgID reusable due to result sent to client.  Also case D (syncprov)?

   Fix by removing the op from op->o_conn->c_ops?  Or does that just
   move the problem around?  Would need to do something to o_conn to
   prevent connection_close() from doing connection_destroy().

O) Operation result has been committed, do not abandon.  ITS#6059.

   But o_abandon can be set while trying to commit, unless this flag is
   set before trying - in which case we can't abandon an operation which
   is failing to commit, which may be when it's most relevant.

   Could reset o_abandon, if anyone can keep straight the consequences.
   Or replace the 'if ( op->o_abandon )' tests with some macro call.
   Still, interactions with other states could be a problem.

About the o_abandon values above:

B can be treated like A, I think.
C differs in that Cancel/Abandon(operation) should not say "already
  abandoned" since the client doen't know about the abandon.
  Could be solved with a vague error message.

D-E differ in that o_abandon gets set even though the backends'
cancel/abandon handlers were not called.  Unsure of the effects of that.

D Syncprov duplicating a Persistent Search operation.
  Handled similar to a server-initiated abandon?  Except if the
  operation cannot be "invisible to Abandon/Cancel" above it must remain
  possible to Abandon/Cancel it.

E Suppress response - response has been sent:
  Set when exiting slap_send_ldap_result() & co?

  Handled similar to a server-initiated abandon?
  At the time slap_send_ldap_result() is called again, the operation
  may have set up things which need to be cleaned up in the normal way.
  Yet it has already gone through that function once, doing callbacks
  etc.  Must "final response" code be prepared to be called twice?

Beyond that, the main problem would be code which transitions to one
state to another, it needs to handle the other cases.

-- 
Hallvard

Comment 3 Hallvard Furuseth 2009-06-03 15:06:19 UTC

changed notes

Comment 4 OpenLDAP project 2014-08-01 21:04:22 UTC

See also ITS#6059, 6103, 6104, (6133), 6157

Comment 5 Quanah Gibson-Mount 2021-06-03 16:49:01 UTC

Can make use of new asyncmeta op code

Comment 6 Quanah Gibson-Mount 2021-06-03 16:49:13 UTC

slapd async op return code, specifically

Comment 7 Howard Chu 2021-07-27 17:18:00 UTC

(In reply to Hallvard Furuseth from comment #0)
> Full_Name: Hallvard B Furuseth
> Version: HEAD
> OS: 
> URL: 
> Submission from: (NULL) (129.240.6.233)
> Submitted by: hallvard
> 
> 
> Might make this a tracking issue for Cancel/Abandon problems.
> And/or copy some to separate ITSes, but they all seem interconnected:
> 
> 
> Cancel(abandoned operation)-requests are not rejected.  Thus slapd
> sets o_cancel and turns an active Abandon into a Cancel.  Presumably
> that can confuse Cancel/Abandon handlers, like that in Syncprov.
> 
> Similarly, Abandon(abandoned/cancelled operation) is not ignored.
> connection_abandon() re-abandons abandoned/cancelled operations too.

Both of the above checks are now uncommented / enabled.
 
> However Syncprov:RefreshAndPersist abuses op->o_abandon: It sets it to
> mean "Suppress the response.  A copy of this operation will send it."
> So if Cancel(op with o_abandon!=0) is fixed to respond protocolError
> "already abandoned", presumably that breaks Cancel(RefreshAndPersist).
> 
> I'm not touching it with a flagpole - I don't know syncprov.  Help?

syncprov has been changed to use the new SLAPD_ASYNCOP returncode so
it no longer needs to muck with this.

> Overlay retcode does something similar - sends a response and then sets
> o_abandon.

I can't see any reason why this is needed. It's been there since the initial
commit but I'm removing it. Any return code != SLAP_CB_CONTINUE already
prevents the frontend from sending its own response so there's no reason
to set this here.
 
> Cancel/Abandon can in any case fail by targeting the wrong operation
> though: A connection can have multiple messages with the same IDs when
> the response is sent and the client reuses the message ID, before the
> old operation in slapd can clean up and finish.
> 
> syncprov_op_abandon() identifies messages by (connid, msgid).  Can
> multiple messages with the same ID break that, or more importantly,
> break what gets sent/written with syncrepl?

Sounds like bad practice in clients, but should still be safe. The messages are queued in order, so only the first matching message will be abandoned. Presumably the first incoming abandon request can only refer to the first queued operation of that ID.

> Anyway, the current code needs an o_abandon value which means "suppress
> response".  Or maybe "abandoned, except as far as future Abandon and
> Cancel operations are concerned".  Syncprov and Retcode need to handle
> the various possible orderings of Cancel/Abandon vs Suppress, including
> when they forward o_abandon from one operation to another.  I haven't
> looked too closely at that code either.

Mooted by use of SLAPD_ASYNCOP returncode.

Comment 8 Quanah Gibson-Mount 2021-07-27 17:36:36 UTC

https://git.openldap.org/openldap/openldap/-/merge_requests/365

Comment 9 Howard Chu 2021-07-27 18:07:27 UTC

(In reply to Hallvard Furuseth from comment #2)
> back-ldap:extended.c also does "suppress response, it has been sent",
> but does it by returning and setting rs->sr_err = SLAPD_ABANDON.  Might
> break assumptions somewhere that SLAPD_ABANDON implies o_abandon was
> set.  And I guess the hack fails if the operation gets cancelled.

Now fixed to always let the frontend handle responses.

But this is still awkward, and will need to be revisited again for ITS#9220.

> ========================================================================
> 
> I think these are the Operation states related to Cancel and Abandon:
> 
> op->o_abandon is set for these - could extend to multiple values:
> A) Operation Abandoned/Cancelled by client.
> B) Operation implicitly abandoned by client. (Bind or lost connection)
> C) Operation abandoned by server.  (It wants to close the connection)
> D) Suppress response - a duplicate of the operation will proceed. (syncprov)
> E) Suppress response - final send_ldap_response() was done. (retcode overlay)
> 
> rs->sr_err == SLAPD_ABANDON if:
> F) The backend obeyed o_abandon.  (Cancel op, if any, will succeed)
> G=E) Suppress response - final send_ldap_response() was done. (back-ldap)

D, E, and G will no longer occur.

> op->o_cancel packs these states/values:
> H) The o_abandon is due to a Cancel.
> I) Cancel operation wants a result, cancelled op must set it and wait.
> J) Result is available to the Cancel operation.
> K) Result. (LDAP result code, or SLAP_CANCEL_ACK for success)
> L) Cancel operation has fetched result, cancelled operation can proceed.
> 
> States that fit in none of the above, or poorly so:
> 
> M) Operation must not be waited for, e.g. by Cancel.
>    Operation is itself waiting for others, e.g. cn=config update.
> 
> N) Operation invisible to Abandon/Cancel/internal abandon.
>    msgID reusable due to result sent to client.  Also case D (syncprov)?
> 
>    Fix by removing the op from op->o_conn->c_ops?  Or does that just
>    move the problem around?  Would need to do something to o_conn to
>    prevent connection_close() from doing connection_destroy().
> 
> O) Operation result has been committed, do not abandon.  ITS#6059.
> 
>    But o_abandon can be set while trying to commit, unless this flag is
>    set before trying - in which case we can't abandon an operation which
>    is failing to commit, which may be when it's most relevant.
> 
>    Could reset o_abandon, if anyone can keep straight the consequences.
>    Or replace the 'if ( op->o_abandon )' tests with some macro call.
>    Still, interactions with other states could be a problem.
> 
> About the o_abandon values above:
> 
> B can be treated like A, I think.
> C differs in that Cancel/Abandon(operation) should not say "already
>   abandoned" since the client doen't know about the abandon.
>   Could be solved with a vague error message.
> 
> D-E differ in that o_abandon gets set even though the backends'
> cancel/abandon handlers were not called.  Unsure of the effects of that.
> 
> D Syncprov duplicating a Persistent Search operation.
>   Handled similar to a server-initiated abandon?  Except if the
>   operation cannot be "invisible to Abandon/Cancel" above it must remain
>   possible to Abandon/Cancel it.
> 
> E Suppress response - response has been sent:
>   Set when exiting slap_send_ldap_result() & co?
> 
>   Handled similar to a server-initiated abandon?
>   At the time slap_send_ldap_result() is called again, the operation
>   may have set up things which need to be cleaned up in the normal way.
>   Yet it has already gone through that function once, doing callbacks
>   etc.  Must "final response" code be prepared to be called twice?
> 
> Beyond that, the main problem would be code which transitions to one
> state to another, it needs to handle the other cases.
> 
> -- 
> Hallvard

Comment 10 Quanah Gibson-Mount 2021-08-03 16:22:51 UTC

Commits: 
  • d3bd4aa7 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 don't allow redundant abandon/cancel ops


  • 5a61175d 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 add lock flag to connection_op_finish()


  • e9e6fd71 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 don't overwrite rs->sr_err after sending response


  • 795add7b 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 use SLAPD_NO_REPLY for persistent searches

The particular code doesn't matter, any result besides
SLAP_CB_CONTINUE always halts overlay/response processing.


  • 658e526b 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 drop unnecessary use of o_abandon


  • 66ed15a2 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 FIx exop handler to ignore SLAPD_ASYNCOPs


  • a54f9985 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 let frontend handle all exop responses


  • 8311c71f 
by Howard Chu at 2021-08-03T15:19:49+00:00 
ITS#6138 fix test timing issue