[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ITS#7052, syncrepl, deletes, and MMR

To: openldap-devel@OpenLDAP.org
Subject: Re: ITS#7052, syncrepl, deletes, and MMR
From: Rein Tollevik <rein@OpenLDAP.org>
Date: Thu, 23 Feb 2012 21:13:59 +0100
In-reply-to: <4F384F7B.8080908@symas.com>
References: <4F384F7B.8080908@symas.com>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2

On 13.02.12 00:47, Howard Chu wrote:

Long time with little OpenLDAP work, but I'm still around ;-)

I'm still seeing cases where deleted entries are getting resurrected
when a number of concurrent Add/Delete sequences are occurring, with
multiple MMR servers (4 minimum to show the error).

Just for the record, this is not the problem reported in this ITS, theITS bug is the same as discussed in this thread:


 http://www.openldap.org/lists/openldap-devel/201012/msg00018.html

The queuing of an old CSN done as a fix to ITS#7052 may have introduceda new race condition, an ITS and fix is coming.

I would prefer a rewrite so that only the frontend assigned CSNs tooperations though. The current situation where syncrepl attaches theentryCSN or old CSNs to the operation just to prevent the backend fromgenerating new CSNs appears to me like curing the symptom rather thanthe sickness..

The problem begins because multiple writes are outstanding, and they are
replicated in persist mode without a CSN in their syncrepl cookie. This
is a normal occurrence when the current op does not correspond to the
last committed CSN.

This looks to me as the root of the problem seen here. Replicatingwithout CSN implies replicating possible incomplete state, and whenthere are multiple paths by which these operations can reach a server weend up with race conditions.

I'd prefer that all changes replicated in persist mode carried a singleCSN and were replicated in CSN order (for all CSNs with the same SIDthat is). It is probably sufficient to enforce this in MMR mode though.

The replicated changes are already being serialized, so serializing themin CSN order shouldn't stall things noticeably and would eliminate thetype of race conditions seen here. And I guess it's already requiredfor delta replication?

The major drawback would be that after a refresh syncprov would have toforce its consumers to refresh as well. I.e the first hop in a chainwould have to complete its refresh before the next hop starts seeing theupdates. But database consistency is most important to me, so I wouldhave no problem living with that.

Because there is no CSN, the consumer doesn't update its cookie state
while performing a particular op.

As a result, if a client does Add/Delete/Add/Delete of the same DN, it's
possible for the Adds to propagate several times (more than the client
actually executed).

Adds and Modifies can usually be rejected if they're too old, because
they carry an entryCSN attribute which can be compared against the
existing entry, even if the consumer cookie state has not been updated.
But Deletes don't carry any attributes, and Deleted entries can't be
checked.

So, given a MMR setup like so:

1 -- 2
|    |
3 -- 4

A sequence of Add/Del/Add/Del performed at server 1 will be replicated
to both 2 and 3 immediately. They will then cascade it to server 4. If
many other writes were occurring at the same time, causing these writes
to be propagated without a cookie CSN, then server 4 will propagate them
back to 3 and 2 respectively, and 3 and 2 will re-add the deleted
entries because they have nothing to check that says the Adds are old.
This cycle only gets broken if server 1 eventually sends an op with
accompanying cookie update, so that all the downstream servers can see
that the ops are old.

There are actually two possible race conditions in this configuration,when an add/delete is performed on the same DN:

1) The add is sent without CSN, the delete with. Assume that theadd/delete is handled by server 3 before it receives them from server 4.It will then act upon the CSN-less add and discard the delete asalready being seen, and end up with an entry not present on the originserver.

2) Neither the add nor the delete are sent with a CSN. This can lead tothe endless add/delete cycle outlined above when there exist loops inthe MMR topology. The cycle will only be broken if the same DN isre-added with a CSN, updating the CSN by changing other entries is notsufficient. The wild CSN-less add will be stopped when it reaches aserver with the newly added entry, and hence also the delete. But whichservers that will end up acting on the delete is yet another racecondition :-(

Hm, given that the replication handles add and modify fairly equal,could a modify/delete sequence be sufficient to trigger these raceconditions?

OK, upon further digging, this appears to be caused by ITS#6024. rein's
patch prevents the consumer and provider from informing each other of
their SIDs when no CSN is present; this prevents syncprov's propagation
loop detection from working. Sigh. Reverting ITS#6024 patch...

Unfortunately, this will not fix scenario 1 and only scenario 2 when allloops includes the server initiating the change. The rid and sid fieldsof the cookie are not sufficient for loop detection in the general case,and as such should only be used for optimization.


A new test script which exercise these race conditions is coming.

Rein

Follow-Ups:
- Re: ITS#7052, syncrepl, deletes, and MMR
  - From: Howard Chu <hyc@symas.com>
- Re: ITS#7052, syncrepl, deletes, and MMR
  - From: Quanah Gibson-Mount <quanah@zimbra.com>

References:
- ITS#7052, syncrepl, deletes, and MMR
  - From: Howard Chu <hyc@symas.com>

Prev by Date: perl backend and Modification.sm_values
Next by Date: Re: ITS#7052, syncrepl, deletes, and MMR
Index(es):
- Chronological
- Thread