[Date Prev][Date Next] [Chronological] [Thread] [Top]

contextCSN propagation problems

The worst problems we had after upgrading to 2.4.x seems to be over now, and the replication appear to work (at least mostly) as it should. One problem that still remains is that the contextCSN attributes themselves don't propagate as I wish they would. Before I start coding or filing more ITSes I would like to have comments as to whether my analysis of the problem seems correct or not.

First, a description of our configuration is needed. I think a test script for this configuration would be nice to have, so I'm going to create that before I do anything else. I currently think of it as syncrepl-asymmetric, but suggestions for better names are welcome.

Now the configuration, which is centered around a central master server.
It has a glued database with a set of subordinates. The central master is the (sole) master for most of these subordinates, but there are also a set of remote site-masters that each has one subordinate database that they are the master for and that is replicated back to the central master.

The site-masters has a similar glued configuration, and they replicates the glue entry from the central master. Different rootdn values on the
subordinates managed by syncrepl and the one the site-master is the master for prevents syncrepl from wiping out the content of that database during the present phase. None of the site-masters receives all the subordinate databases the central master has, and which they receive varies. This is controlled by acl rules on the central master.

On each of the sites (including where the central master is) there are search-only servers that replicates the glue suffix from their site-master (or the central master). The search servers has a single database (for historical reasons), but their layout shouldn't matter very much.

All of the master servers uses syncprov on the glue database, and everyone except the central master uses syncrepl on the glue database. The central master cannot use it there, as that would have caused it to wipe out those subordinates that aren't on the site-master it replicates from during the refresh phase. Different serverIDs are used on the master servers, so an updated contextCSN set should include as many values as there are master servers.

Now to the problem. If a modification is made (on the central master) to a subordinate database that isn't replicated to one of its consumers it will not receive the updated contextCSN value from the central master. Which means that I cannot monitor the contextCSN values to verify that the replication is working as it should. And the consumers will (after a restart) present an outdated contextCSN set during the refresh phase, even though their database content is up to date. But this will be corrected when/if an update is made to a subordinate db that is replicated to the consumer.

A worse problem is when the modifications is made to a subordinate db the central master replicates from one of the site-masters. In this case there will never be any updates from that site-master that should be propagated. So the consumers will, until it restarts, be stuck with the contextCSN value matching that remote site-master which it received after the present phase.

Or, oh well, that is not quite true... Due to a bug in syncprov it fails (on the central master) to detect and filter out the contextCSN update from syncrepl on a subordinate db. Which means that the central master will send an update of the glue entry to its consumers, so that the value is propagated from the central master to its immediate consumers. But it will not propagate further from the site-masters to the search servers on their site. As this bug makes things work at least partly as I wish I'm a bit reluctant to have it fixed yet...

So far my proposed solution to this problem is that syncprov_matchops() should, when a modification fails to match the test_filter() nor is a deleted entry, send a sync info protocol message with the updated contextCSN in the newCookie field to its consumers. Does this sound like a valid solution? There seem to be support in syncrepl.c for receiving these messages, and in syncprov.c for sending them. But it never actually does it as far as I can see.

Rein Tollevik
Basefarm AS