[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Replication failure after error fixed



--On Friday, April 20, 2007 11:55 AM +1000 Dave Horsfall <daveh@ci.com.au> wrote:

OpenLDAP 2.3.32 (our policy is to run STABLE unless there's a bugfix we
need).

Most of our sites replicate direct to each other (SyncRepl; you need to
know that data for a country is mastered in that country), except for one
situation:

A <-> B <-> C

A and C are masters for their data, and B is a pure slave.  For political
reasons (i.e. it won't get fixed) A and C cannot replicate direct.

Because a schema change was not made on B, some updated data on A did not
get through.  All well and good, we fix the schema on B, and wait for the
update (we use refreshAndPersist).

Except it never happened.  Blowing away the slave on B caused it to
update (of course), except it still never reached C, until it in turn
was repopulated.

Am I looking at a replication bug?  It seems to me that once the schema
was fixed, the replication should have happened.  Or am I not
understanding how SyncRepl works?

This sounds strikingly similar to a bug I've encountered in the past with delta-syncrepl where the CSN was incorrectly updated after a failed MOD (due to differences because the replicas had an overlay on that the master didn't). I've had it on my to-do to really get the logs for this, but have been busy on other things. I'll see if I can set some time aside to re-produce this and get the necessary information so it can be fixed.


--Quanah

--
Quanah Gibson-Mount
Senior Systems Software Developer
ITS/Shared Application Services
Stanford University
GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html