6059 – Abandon syncprov race condition?

Issue 6059 - Abandon syncprov race condition?

Summary: Abandon syncprov race condition?

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.16
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-04-14 16:34 UTC by Rein
Modified:	2020-03-19 15:28 UTC (History)
CC List:	0 users

See Also:	6138

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Rein 2009-04-14 16:34:53 UTC

Full_Name: Rein Tollevik
Version: 2.4.16
OS: linux
URL: 
Submission from: (NULL) (81.93.160.250)
Submitted by: rein


I've had two cases where a delete operation was performed on the master without
being replicated to its consumers, which so far appear to be cases of possible
connection lost (abandon) race conditions.  The log (level: stats) shows the
"DEL" message of the entry, immediately followed by a "closed (connection lost)"
message on the connection.  Note: No "RESULT" message was logged.

I haven't looked very much into this, but my theory so far is that syncprov
skipped replicating of the delete op after noticing the abandon resulting from
loosing the connection, even though the delete had already taken place in the
local database.  That it happened after a delete op might very well have been a
coincident, this possible race could exist after any modify op for all I know.

Do we need some sort of o_committed flag that can be used to prevent o_abandon
from being set or acted upon? Or handle o_abandon more like o_cancel, i.e with
multiple values, including "too late"?

Rein Tollevik
Basefarm AS

Comment 1 Hallvard Furuseth 2009-04-17 14:24:15 UTC

rein@OpenLDAP.org writes:
> I've had two cases where a delete operation was performed on the
> master without being replicated to its consumers, which so far appear
> to be cases of possible connection lost (abandon) race conditions.

Not sure if this is the problem, but it is ugly: slapd/cancel.c sets
o_abandon with op->o_conn->c_mutex locked, but waits to set o_cancel
after it's unlocked.  Looks like that can give slapd a chance to react
to o_abandon before it "knows" that abandon is actually a cancel.

> Do we need some sort of o_committed flag that can be used to prevent
> o_abandon from being set or acted upon? Or handle o_abandon more like
> o_cancel, i.e with multiple values, including "too late"?

o_cancel is a wrapper around o_abandon, turning result code
SLAPD_ABANDON into LDAP_TOO_LATE etc.  However slap_send_ldap_result()
and send_ldap_response() skip "if (op->o_callback) slap_response_play()"
if o_abandon is set, and "send" SLAPD_ABANDON instead of the result
code.  Can that work right?  The code looks like SLAPD_ABANDON ought to
mean "nothing was done" right up till everything has had a chance to
react the same way to an operation.

-- 
Hallvard

Comment 2 Howard Chu 2009-05-11 02:38:00 UTC

rein@OpenLDAP.org wrote:
> Full_Name: Rein Tollevik
> Version: 2.4.16
> OS: linux
> URL:
> Submission from: (NULL) (81.93.160.250)
> Submitted by: rein
>
>
> I've had two cases where a delete operation was performed on the master without
> being replicated to its consumers, which so far appear to be cases of possible
> connection lost (abandon) race conditions.  The log (level: stats) shows the
> "DEL" message of the entry, immediately followed by a "closed (connection lost)"
> message on the connection.  Note: No "RESULT" message was logged.
>
> I haven't looked very much into this, but my theory so far is that syncprov
> skipped replicating of the delete op after noticing the abandon resulting from
> loosing the connection, even though the delete had already taken place in the
> local database.  That it happened after a delete op might very well have been a
> coincident, this possible race could exist after any modify op for all I know.

> Do we need some sort of o_committed flag that can be used to prevent o_abandon
> from being set or acted upon? Or handle o_abandon more like o_cancel, i.e with
> multiple values, including "too late"?

No. What good can that do, since the connection has already been lost?

It doesn't matter if syncprov fails to send an update to a consumer - the 
consumer's cookie state will let it pick up where it left off when it reconnects.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Rein 2009-05-11 07:13:43 UTC

hyc@symas.com wrote:
> rein@OpenLDAP.org wrote:

>> I've had two cases where a delete operation was performed on the master without
>> being replicated to its consumers, which so far appear to be cases of possible
>> connection lost (abandon) race conditions.  The log (level: stats) shows the
>> "DEL" message of the entry, immediately followed by a "closed (connection lost)"
>> message on the connection.  Note: No "RESULT" message was logged.
>>
>> I haven't looked very much into this, but my theory so far is that syncprov
>> skipped replicating of the delete op after noticing the abandon resulting from
>> loosing the connection, even though the delete had already taken place in the
>> local database.  That it happened after a delete op might very well have been a
>> coincident, this possible race could exist after any modify op for all I know.
> 
>> Do we need some sort of o_committed flag that can be used to prevent o_abandon
>> from being set or acted upon? Or handle o_abandon more like o_cancel, i.e with
>> multiple values, including "too late"?
> 
> No. What good can that do, since the connection has already been lost?
> 
> It doesn't matter if syncprov fails to send an update to a consumer - the 
> consumer's cookie state will let it pick up where it left off when it reconnects.

It isn't the connection to the syncprov consumer that was lost, it is 
the connection to the client that made the change.  The abandon may 
cause syncprov to abandon the modify op (in syncprov_op_abandon), and it 
will definitely cause the entire response callback to be skipped.  Which 
is where syncprov sends updates to its consumers.  The change will take 
place in the local database, but not be replicated, nor will auditlog 
log it.  Accesslog does though, probably since the cleanup callback it 
enables in accesslog_op_mod explicitly calls its response callback.

The syncprov clients will receive updates to the csn when new 
modifications takes place, i.e the clients must be restarted with the 
"-c" option to resync their databases.

Rein

Comment 4 Hallvard Furuseth 2009-06-03 14:37:33 UTC

changed notes
moved from Incoming to Software Bugs

Comment 5 Howard Chu 2011-11-08 00:03:58 UTC

changed notes
changed state Open to Closed

Comment 6 OpenLDAP project 2014-08-01 21:04:22 UTC

See also ITS#6138 (abandon/cancel)
Fixed by ITS#7062