[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: contextCSN of subordinate syncrepl DBs

Howard Chu wrote:

It appears that the current code is sending newCookie messages pretty much all
the time. It's definitely too chatty now, and it appears that it's breaking
test050 sometimes, though I still haven't identified exactly why. I thought it
was because the consumer was accepting the new cookie values unconditionally,
but even after filtering out old values test050 still failed. #if'ing out the
relevant code in syncprov.c makes test050 run fine though. (syncprov.c:1675
thru 1723.)

The newCookie messages should only be sent if the local csn set is updated without being accompanied by a real modification. And in an MMR setup like test050 that should never happen except maybe during the initial replication of the databases.

Any occurrences of newCookie message should be a symptom of another bug, and I do believe one such race condition exist in syncrepl. One possible scenario, with 4 or more hosts:

server1 makes two or more changes to the db, with csn n and n+1.
server2 receive both, and starts replicating them to server3.
server3 receives and starts processing the first change from server1. It updates cs_pvals in the syncrepl structure with the csn n of the first modification. Then, the same modification is received from server2, but is rejected as being too old. The second modification is received from server2, this time being accepted. This second modification is tagged with csn n+1, which gets stored in the db by syncrepl_updateCookie and picked up by syncprov. syncprov on server3 replicates the second change with csn n+1 to server4. server4 accepts the second modification from server3, without having received the first change. And when that arrives from server1 or 2 it will be rejected as being too old.

If the second modify operation is received and processed by server3 after it have added csn n to the csn queue, but before it is committed, the second modification will be tagged with csn n. The csn being written to the db is still csn n+1 though, which will be picked up by syncprov and trigger a newCookie message. Even without this, the csns stored in the db on server3 is invalid and will result in an incomplete db should it fail before the first modification completes.

The csns for any given sid are sent by the originating server in order, I think the fix should be to always process them in the same order in syncrepl. For each sid in the csn set there should be one mutex, and modifications with any given sid should only take place in the thread holding the mutex. To avoid stalling too long it must be possible for the other syncrepl stanzas to note that a csn is too old without waiting on the mutex for the csn sid.

I don't think it is correct for syncrepl to fetch csn values from syncprov either. The only csn syncprov can update is the one with the local sid, and syncrepl should simply ignore modifications tagged with csn values with its own sid. Provided syncrepl starts the replication phase with a csn value with its own sid that is. The latter is to cover the case where a server is being reinitialized from one of its peers, it should then accept any changes that originated on the local server before it was reinitialized. Upon completing the initial replication phase it will receive a csn set that may include its own sid, and it should start ignoring modification with that sid.

Neither of these changes should interfere with ordinary multi-master configurations where syncrepl and syncprov are both use on the same (glue) database.

Having spent the last 12 hours prodding at test050 I find that whenever I have
it working well, test058 "breaks" with contextCSN mismatches. At this point I
really have to question the rationale behind test058. First and foremost,
syncprov should not be sending gratuitous New Cookie messages to consumers
whose search terms are outside the scope of the update. I.e., if the actual
data update didn't go to the consumer, then the following cookie update should
not either. In such an asymmetric configuration, it should be expected that
the contextCSNs will not match across all the servers, and forcing them all to
match is beginning to look like an error, to me.

Whenever the provider makes a local change that should not be replicated to the consumer the consumers database state continutes to be in sync. Yet, its csn set indicates that it isn't and it will always start out replicating all changes made after the oldest csn it holds. Which can be quite a lot. The only way to fix this is to send the newCookie messages.