[Date Prev][Date Next] [Chronological] [Thread] [Top]

Multimaster Algorithm (was: Documentation volunteer (which was: Cluster replication: multiple masters))



> So (for the benefit of the archives), it boils down to this:
>
>     In theory, LDAP allows multi-mastering, if the conflicts can be
>     resolved (e.g. timestamps, per NDS).

	Here is an algorithm that I have been working on for safe
multimaster replication.  Timestamps are still unsafe in the real world
for the reasons that have been mentioned.  But I think I have a bettter
solution.

	Currently the multimaster code avoids an infinite
write-update-write-update loop by looking at the update DN that slurpd
connects with.  Instead of using an update DN, as the current
"experimental" code does, just use the following rules:

1. The LDAP master that is being written to should only "broadcast" an
update to its slaves (which are also all masters) if the update the entry
data being updated is _different than_ the data that was already in the
database.

	By "different than", I mean the data for the DN update being
written does not match the data for the DN in ldbm (as per the LDAP
compare rules).

	By "broadcast", I mean slurpd connects to all the replica slaves
listed in the /etc/openldap/slapd.conf and propagates the update.

	This functionality would fix the infinite update loop issue
because a master which already has the data -- as would be the case after
the very first iteration of the update loop -- would not continue
propagating the change.  Thus the loop is broken.

	But this does not address what happens if two masters are written
to at "the same time".  For that, see rule 2:

2. When broadcasting updates, an update for a particular DN should only be
immediately broadcast if an update for that DN has not been received
within the last x milliseconds.  If an update *has* been received within
the last, say, .1 seconds, then wait "a few milliseconds".

	Those "few milliseconds" that we are waiting gives the other
master a chance to process our simultaneous update of the same entry.
This is analogous to ethernet fallback (is that the right terminology?).
The "few milliseconds" should be a randomly-chosen number within a small
range, again just like ethernet.  (There are algorithms for picking these
"few milliseconds"  times, with fallback if further collisions occur, but
I don't know much about them.  Somebody with ethernet experience will need
to comment.)

	The intent of this rule is to prevent an infinite update loop in
the special case that a write is done to the same entry at "the same
time", where "the same time" means before the masters being updated can
update the other masters being updated (with differing information).

	Here's an example:

	slapd_A and slapd_B are both masters replicating to each other.

slapd_A gets an update for the DN foo.  Let's say foo is set to 5.

slapd_B gets an update for the DN foo at "the same time", meaning, before
it knows anything about the '5' update that has just occured at slapd_A.
slapd_B's update is setting foo to '7'.

	They both broadcast their updates to one-another at the same time,
meaning, the slurpd packets are travelling on the network at the same
time.

slapd_B now gets the '5' update (from slapd_A's slurpd).  It thinks "hey,
I just got an update to '7' for this DN, so I'll wait a few ms before
broadcasting this latest update to all my slaves."

slapd_A now gets the '7' update (from slapd_B's slurpd).  It thinks "hey,
I just got an update to '5' for this DN, so I'll wait a few ms before
broadcasting this latest update to all my slaves."

	Now the "few milliseconds" in the sentences above will be randomly
chosen (within a certain range), so that the next iteration of the above
update process will occur asynchronously, meaning one will occur before
the other, meaning, our problem is solved.  Once we get to this point,
it's as if the enduser did a write on one master, and then, later in time,
did a write to the other master.

	Note that above I was talking about "a few milliseconds", but
there's no reason that couldn't also be "a few seconds", to minimize the
likelihood of collisions and reduce network traffic.  I, for one, would be
satisfied if multimaster replication worked safely, even if there was a 2
or 5 second delay before updates made it to all the other masters.

	Can anyone see a reason why the above algorithm would not ensure
multimaster replication without conflicts?  If somebody can help me with
the fallback algorithms, I would be happy to work on implementing this.
In fact, this could all be implemented in slurpd, so we wouldn't have to
muck around with the slapd code at all (so no more "experimental" #defined
junk in the database code).


Thanks,
Derek