[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: contextCSN of subordinate syncrepl DBs

Rein Tollevik wrote:
> I've been trying to figure out why syncrepl used on a backend that is 
> subordinate to a glue database with the syncprov overlay should save the 
> contextCSN in the suffix of the glue database rather than the suffix of 
> the backend where syncrepl is used.  But all I come up with are reasons 
> why this should not be the case.  So, unless anyone can enlighten me as 
> to what I'm missing, I suggest that this be changed.
> The problem with the current design is that it makes it impossible to 
> reliably replicate more than one subordinate db from the same remote 
> server, as there are now race conditions where one of the subordinate 
> backends could save an updated contextCSN value that is picked up by the 
> other before it has finished its synchronization. An example of a 
> configuration where more than one subordinate db replicated from the 
> same server might be necessary is the central master described in my 
> previous posting in 
> http://www.openldap.org/lists/openldap-devel/200806/msg00041.html
> My idea as to how this race condition could be verified was to add 
> enough entries to one of the backends (while the consumer was stopped) 
> to make it possible to restart the consumer after the first backend had 
> saved the updated contextCSN but before the second has finished its 
> synchronization.  But I was able to produce it by simply add or delete 
> of an entry in one of the backends before starting the consumer.  Far to 
> often was the backend without any changes able to pick up and save the 
> updated contextCSN from the producer before syncrepl on the second 
> backend fetched its initial value.  I.e it started with an updated 
> contextCSN and didn't receive the changes that had taken place on the 
> producer.  If syncrepl stored the values in the suffix of their own 
> database then they wouldn't interfere with each other like this.
> There is a similar problem in syncprov, as it must use the lowest 
> contextCSN value (with a given sid) saved by the syncrepl backends 
> configured within the subtree where syncprov is used.  But to do that it 
> also needs to distinguish the contextCSN values of each syncrepl 
> backend, which it can't do when they all save them in the glue suffix.
> This also implies that syncprov must ignore contextCSN updates from 
> syncrepl until all syncrepl backends has saved a value, and that 
> syncprov on the provider must send newCookie sync info messages when it 
> updates its contextCSN value when the changed entry isn't being 
> replicated to a consumer.  I.e as outlined in the message referred to above.

It appears that the current code is sending newCookie messages pretty much all
the time. It's definitely too chatty now, and it appears that it's breaking
test050 sometimes, though I still haven't identified exactly why. I thought it
was because the consumer was accepting the new cookie values unconditionally,
but even after filtering out old values test050 still failed. #if'ing out the
relevant code in syncprov.c makes test050 run fine though. (syncprov.c:1675
thru 1723.)

> Neither of these changes should interfere with ordinary multi-master 
> configurations where syncrepl and syncprov are both use on the same 
> (glue) database.

Having spent the last 12 hours prodding at test050 I find that whenever I have
it working well, test058 "breaks" with contextCSN mismatches. At this point I
really have to question the rationale behind test058. First and foremost,
syncprov should not be sending gratuitous New Cookie messages to consumers
whose search terms are outside the scope of the update. I.e., if the actual
data update didn't go to the consumer, then the following cookie update should
not either. In such an asymmetric configuration, it should be expected that
the contextCSNs will not match across all the servers, and forcing them all to
match is beginning to look like an error, to me.

> I'll volunteer to implement and test the necessary changes if this is 
> the right solution.  But to know whether my analysis is correct or not I 
> need feedback.  So, comments please?

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/