[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Multimaster further work



On Fri, Oct 25, 2002 at 12:49:44AM -0700, Howard Chu wrote:

> The issues here aren't much different from the concurrency issues in an SMP
> operating system. Any time the possibility for contention arises, you have to
> explicitly serialize your processing.

And later:

> The idea here is that as long as you can fully serialize your updates,
> everything is OK. As soon as serialization is broken, all bets are off,
> you're back to the same consistency problems as the current multimaster
> approach. E.g., you have servers A and B, A is the master. The network
> between A and B fails, but both servers are still running. B promotes itself
> to master and accepts updates while the network is down. When the network is
> restored, you get to see the mess that's been made. I suggest that this may
> require manual intervention; long sequences of updates that depend on each
> other may have already been made so it's not just a matter of looking at
> timestamps or arbitrarily choosing which update to keep. Again, this isn't a
> huge problem, tools like CVS deal with this all the time. Lots of times CVS
> can merge separately patched source trees without any issue, sometimes it
> finds a conflict and you have to fix it by hand.

Two very important points there: global serialisation, and the
possibility of network partition.

This leads to rather different results in the two cases that we seem
to be discussing:

a)	Clustered servers behind a load balancer, assumed to be all in
	one location and on a very fast local network. This is done
	for performance and can also provide single-point HA.

b)	Servers distributed around the world (or solar system) to
	provide quick response to local clients while also providing
	continuity of service during network outages.

In case (a) we can assume that network partition between servers is
extremely unlikely (or at least, that if it happens, the partitioned
servers will no longer be talking to clients). We can also assume that
the time for messages to propogate from one server to another is
finite and 'small' so the arguments based on MAX_LAG_TIME may be
usable. DEC's VAXclusters worked on this principle, and enforced it
with a quorum rule: if any server lost contact with half the others or
more it was required to reboot instantly and not to perform further
work until a quorum had been re-established.

In case (b) there are very few useful assumptions to be made. In
particular, it is almost certain that there will be network partitions
with active clients and servers on both sides of the partition. No
useful value could be chosen for MAX_LAG_TIME in this situation.

If we want to give absolute assurance of database consistency across all
servers, it is essential that all servers receive and acknowledge each
update *before the result of the operation is returned to the client*.
Even then, there is a possibility of divergence because they should
only commit the update if all other servers agree to do so, and the
message confirming this might get lost. This is a "three army
problem" so we know that it cannot be done with absolute assurance,
but if the probability of message loss is known we can design
protocols to achive any given probability of correctness:

	http://www.cs.concordia.ca/~teaching/comp445/fall97/assigns/3army.htm

A multi-master system based on this rule might be appropriate in the
HA scenario: probability of message loss is low, message latency is
low, and we want all servers to be created equal so that fail-over is
not an issue. We can deal with a server going down by maintaining
keep-alive connections, and requiring a full consistency check on any
server about to re-join a cluster before it starts accepting client
connections.

In the widely-distributed case, it is not possible to provide good
client response times for updates and also strong consistency guarantees. 
A choice has to be made depending on the application:

1)	Where quick updates are more important than global
	consistency, a distributed multi-master approach is
	appropriate.

2)	Where global consistency is essential, a single-master
	approach is better. This is because a multi-master update
	operation must wait until all servers respond, so any one
	server being unreachable will block updates (possibly for
	weeks!)

We already know how to do (2). (1) can be done very simply if we
really do not care about the consistency issue (which in many cases is
a non-issue in practice due to the pattern of use).

Here is another take on the clash-detection algorithm:

Suppose we maintain an MD5 hash of each entry in the database. Now,
when an update is made, the server handling it creates a message to
all other masters containing the updated data plus the old and new MD5
hashes. All servers receiving the update first check that the old hash
matches the one they already have. If the hashes match, the update
proceeds. If the hashes differ, a conflict is signalled.

Some non-conflicting changes to an entry can be detected and allowed by
finding the hash of the entry that would result if the update were
allowed to go through, and comparing it with the new hash.

Now, what should happen in the case of a conflict? It may not be seen
at all masters (different propogation delays on different routes) so
the only safe response is to undo the change everywhere. Implementing
this safely requires updates to use a two-phase commit. It also means
that no updates can proceed unless all masters are reachable, which is
why I say that multi-master is OK in HA clusters. If we are prepared
to accept some inconsistencies in the data, then it is probably best
to just generate messages to human admins when conflicts are detected.
The chances are that an overriding update done later will not clash
with another simultaneous change, but it *will* need to be marked as an
override in the protocol because it must be applied to copies of the
entry that already differ from one server to another.

Summary:

I think multi-master HA clusters should use a new propogation
mechanism based on two-phase commit to all masters synchronously so
that when the client gets the result status it applies to *all*
copies.

Widely spread networks of servers must choose either single-master
operation or accept manual conflict resolution with entries remaining
write-locked until the override message is received.

There are three modes of operation here. Only one of them is currently
supported in the code. (Strictly there is a fourth mode which is
supported by the current multi-master code: allow multi-master
operation and take no steps to detect or correct clashing updates)

Andrew
-- 
-----------------------------------------------------------------------
|                 From Andrew Findlay, Skills 1st Ltd                 |
| Consultant in large-scale systems, networks, and directory services |
|     http://www.skills-1st.co.uk/                +44 1628 782565     |
-----------------------------------------------------------------------