[Date Prev][Date Next] [Chronological] [Thread] [Top]

Multimaster further work



     All,
	I have been greatly bothered by the fact that OpenLDAP does not
have a sound means for multimaster replication (as per the recent thread
titled "Cluster replication: multiple masters").  I see this as a critical
feature for enterprise-level scalability and redundancy.  The current
practice of depending on a single master writer (with basic failover) does
not scale well enough, in my opinion, and furthermore, in mission-critical
applications this practice provides a single point of failure for the
length of time it takes to fail over to a backup.

	I have an idea that I want to propose for possible implementation.
Unfortunately, I don't have time to work on any code for this right now,
and I don't foresee having any time for the next one or two months at
least.  But I wanted to get this out there for discussion and refinement
in the hopes that I can come back to this early next year, and possibly
implement something good.

	In thinking about this problem, I found it helpful to consider a
group of masters where one was on the Earth, one was on the Moon, and one
was on Mars.  This scenario has the advantage that there are wildly
varying (and very long) lag times between the masters.  Besides that, the
algorithm we implement should be good enough for that exact scenario, once
humanity gets off its ass and starts colonizing the solar system :)

	Here is how I see the problem:

	When there is more than one master, those masters can be written
to at "the same time", resulting in ambiguity as to which write should be
replicated to all other masters in the group.  I call this a "conflict".

	"The same time" means that some of the replicas in the group will
receive one of the updates first, but other replicas in the group could
receive a differing update first.  In the end, the replicas could have
different information stored for the same DN.

	This can only happen when two different writes come in for a DN
and less than MAX_LAG_TIME seconds has passed.  "MAX_LAG_TIME" is defined
to be the maximum amount of time that it can take for any one master to
talk to another master, i.e., the worst case scenario ping time between
the furthest-apart masters on the network.  If one master is in Australia
and another master is in Finland, that MAX_LAG_TIME could be a few
seconds.  If all masters are in a load-balanced cluster in a datacenter,
that MAX_LAG_TIME could be usecs.  But the point is, if we know that
MAX_LAG_TIME has passed since the last write for a particular DN, then we
know that there is no conflicting write, because enough time has passed
for us to know (for sure) that no other conflicting write has occurred.

	A quick side-note about conflicts: If we get two writes for a DN
within less than MAX_LAG_TIME, it could only be a conflict if both of
those writes are from different masters in our replicated group.  If an
enduser makes two quick changes to the same DN, and we can see that his
I.P. Address is not one of the other masters, we know it's not a conflict,
and we could just commit the writes in-order.  But for the sake of
simplicity, I ignore this for now and treat all potentional conflicts as
conflicts.  But there is room for optimization here in the future.

	Back to the problem: once we have identified two writes for the
same DN as being conflicting (meaning, they came in with less than
MAX_LAG_TIME passing between them), we need to resolve the conflict so
that all masters everywhere write the same data.

	We assume that it doesn't really matter what conflicting data gets
written -- it just has to be universal across all the masters.  That is,
if two endusers change the same DN within MAX_LAG_TIME of each other, we
offer no guarantees as to which one actually gets written.  Frankly, I
don't see how we could offer that guarantee anyway.

	Now, with that assumption, all we need to do is make sure that ONE
of the conflicting writes (once identified) gets written to all the
replicas.  To do that, I propose:

1. Don't write any of the conflicting DN entries until MAX_LAG_TIME has
passed.  This guarantees that all of the masters have submitted all
potential conflicting writes, and furthermore, it guarantees that all
masters have received any writes that we had propagated before we detected
this conflict.

2. Once MAX_LAG_TIME has passed, make an MD5 Digest of the data for each
conflicting write individually.  This is the write's "score".

3. The DN data with the highest score wins.  This means that, after all
masters have received all conflicting writes, they will all create the
same MD5's for each, and thus will agree on which write gets commited to
the database.  All the other DN data gets dropped.

	This method does not depend on any (inaccurate and out-of-sync)
computer clocks, serial numbers, or anything else.  Establishing a
MAX_LAG_TIME for the network can be tricky, but it's the best limiting
factor I can see for determining when we do, or do not, have a conflict.
That is, if one of the masters dies completely, we can't wait forever to
see if there's a conflict--we have to set a timeout somewhere.

	In the real world, we can take the maximum measured ping times and
just multiply them by some tolerance factor, i.e., times 2 (or 10 or
whatever).  An enterprise-level setup will be monitoring this connection
24/7, and will immediately report if the connection gets so unstable that
pings greater than MAX_LAG_TIME are seen.

	There is a chance that the MD5's will be identical for two
different sets of DN data, but I have read before that the Universe is
likely to end before that actually happens.  I'm not a hashing expert, so
I'm not sure.  But we could easily come up with some kind of
conflict-resolution scheme if that happens, such as adding the originating
master's I.P. address to the data before hashing.

	In the end, if we ever receive two writes at "the same time"
(within MAX_LAG_TIME of each other), then every other master in the group
gets to (a) be sure they have all the conflicting writes for
consideration, and (b) choose the same write as the "winner" that gets
committed to the database.

	So that is where I am currently at with my thoughts.  I'm hoping
people can submit corrections, improvements, and refinements so that I
have something to work from if and when I get the time.


Thank You,
Derek Simkowiak