[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Antw: Replication through BDB



Hey,

>> Generally the majority can be wrong: Assume you have a
>> network-failure in a three-node MMR configuration: You update one
>> node while the other two are unreachable. The communication resumes,
>> do you expect the change on the none node to be reverted to majority,
>> or should the majority be updated from the one node that has more
>> recent data?
>
> Indeed. In syncrepl, "voting" is irrelevant. Changes will be accepted
> by any provider node that a client can reach. When connectivity is
> restored all nodes will bring each other up to date. In majority-based
> voting, you will lose any writes to the minority node, which leaves
> you with unresolvable inconsistencies. I.e., data is removed but the
> clients believe it was written.

This turns out to be a matter of choice -- I would not go for majority
voting without getting confirmation from a majority about the success of
a transaction, which is what BerkeleyDB does.

What I'm hearing here is that this "formal" approach leads to more
delays, and it doesn't add much in practice -- just the *certainty*
about data having been stored with the quality level assured by
replication.  The certainty comes at a writing delay, and is only of use
when lightning strikes just after a write to one master.

Interestingly, OpenStack Swift takes the same approach -- commit a write
based on local storage, then replicate later.

> back-hdb and back-bdb both use BerkeleyDB. BerkeleyDB is now
> deprecated/obsolete, and LMDB is the default backend.

I'm preparing new installations, so I suppose I will get to see it as
the default.

> BDB's replication is page-oriented, so it would consume far more
> network resources than syncrepl. We have never recommended its use.

It was indeed a design consideration that I was weighing.  I think the
trade-off recommended here is clear, and makes sense.  I don't flush
after every disk use either, after all.


Thanks,
 -Rick