[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Troubleshooting synchronization



On Wed, Nov 11, 2009 at 3:01 PM, Howard Chu <hyc@symas.com> wrote:
> Edward Capriolo wrote:
>> On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG)
>> <tschlabach@tascel.net> wrote:
>>> Hi Quanah!
>>>
>>>> I suggest you go read the CHANGES log for what has been fixed between
>>>> 2.4.11 and the latest stable 2.4.19.
>>>
>>> I need to say, it worries me a bit that for problems with a core feature
>>> which has been around for quite some time, the answer is more often that
>>> I like to hear: You need to use the latest version released last week /
>>> month or so.
>>>
>>> I have indeed read the CHANGES and seen that some issues have been
>>> fixed. I have no idea if we are affected by those issues or now.
>>>
>>> Also how would I know that *now* in 2.4.19 all problems are fixed and
>>> the answer next week won't be: You need to use 2.4.20.
>>>
>>> But as this is a FOSS project and not a product we pay for, we
>>> understand that we should not blame people but try and help if we find a
>>> a problem.
>>>
>>> For that reason I have asked in my email for help on *understanding* and
>>> *diagnosing* problems to have a chance to contribute in case we will
>>> find any new issues.
>>>
>>> Also our customers may not like it if in case of a problem we tell them:
>>> Let's wait if in some weeks a new release will come which will fix it or
>>> not. So I'd rather be in a position to get my hands dirty myself in case
>>> of problems.
>>>
>>> Regards,
>>> Torsten
>>>
>>>
>>> Quanah Gibson-Mount schrieb:
>>>> --On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach
>>>> (Tascel eG)" <tschlabach@tascel.net> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> I am currently trying to chase some problems in an n-way multi-master
>>>>> setup with three servers. We have used the instructions at
>>>>>
>>>>> http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
>>>>>
>>>>> as our guidance and we are using OpenLDAP version 2.4.11.
>>>>
>>>> I suggest you go read the CHANGES log for what has been fixed between
>>>> 2.4.11 and the latest stable 2.4.19.
>>>>
>>>> --Quanah
>>>>
>>>> --
>>>>
>>>> Quanah Gibson-Mount
>>>> Principal Software Engineer
>>>> Zimbra, Inc
>>>> --------------------
>>>> Zimbra ::  the leader in open source messaging and collaboration
>>>
>>
>>>> Also how would I know that *now* in 2.4.19 all problems are fixed and
>>>> the answer next week won't be: You need to use 2.4.20.
>>
>> Testing reveals the presence of bugs, not the absence :)  So no one
>> can every say version x.y.z is certified bug free.
>>
>> However, I do tend to agree, in that my MM just flaked out, and there
>> is not much load/write/update going on so I am a bit worried.
>>
>> I am not trying to put down OpenLDAP but iplanet/fedora directory
>> server/389 support up to a 4 way MM implementation and I have found
>> the replication rock solid even under high load. So if MM is your
>> requirement that may be a more valid option.
>
> The historical evidence disagrees with your assertion. Even at this late date,
> FDS MMR still breaks irrecoverably.
>
> https://www.redhat.com/archives/fedora-directory-users/2009-November/msg00056.html
>
>
> How many years have they been flogging this feature? They still haven't got it
> right. They can't.
>
> MMR is inherently flawed, as we have been saying for years.
>
> http://www.watersprings.org/pub/id/draft-zeilenga-ldup-harmful-02.txt
>
> We have implemented it in OpenLDAP mainly for political reasons, not because
> we changed our minds and now believe it to be technically sound. It is not. We
> developed and recommend MirrorMode because the only safe way to do replication
> is by preserving single-master consistency.
>
>>>> The answer is quite simple: do not use multimaster replication in a
>>>> production environment. In most cases the requirement for multimaster
>>>> replication is just based on poor directory design.
>>
>> Dieter, I do not agree with that. You can't blame a user for using a
>> feature. It is not marked as experimental anymore so people are going
>> to use it. Once it fails you can't call them a "Poor Directory
>> Designer" for using it.
>>
>> http://www.openldap.org/faq/data/cache/1240.html
>
> If they have implemented MMR without reading all of the warnings, they are
> certainly poor designers for not becoming fully informed of the topic before
> deploying it. If they have implemented MMR after reading all of the warnings,
> they made a conscious choice.
>
> --
>  -- Howard Chu
>  CTO, Symas Corp.           http://www.symas.com
>  Director, Highland Sun     http://highlandsun.com/hyc/
>  Chief Architect, OpenLDAP  http://www.openldap.org/project/
>

I understand that open LDAP does not do distributed locking, as a
result I do not expect it to have ACID compliance.

Fedora Directory Server/389 has a "last update wins policy"  so this
is a much more optimistic strategy, but it works (for what I was
doing)

Since I have joined this mailing list after my problems started, about
a month ago, I have seem at least 4 other threads with similar issues.

http://www.openldap.org/lists/openldap-software/200911/msg00015.html
http://www.openldap.org/lists/openldap-software/200911/msg00021.html
...

Upgrade to  2.4.19 is suggested as a resolution, and I found another
thread with a bigger problem in that version.


As to the link you have posted:
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg00056.html

It is very easy to quickly search a mailing list and find some people
having problems software. That does not prove FDS has many MM
problems. I personally ran two node FDS instance with very active
WRITE/UPDATE for two years and had only a few isolated problems.


>> If they have implemented MMR without reading all of the warnings,
>> they are certainly poor designers for not becoming fully informed of the topic before deploying it.

>From my prospective, I find the reliability of M-M openldap on 2.4.16
brittle. I am not the only one having problems. Your comment seems to
suggest I did not read enough. I would upgrade to 2.4.19 but someone
else on this list is having problems with that so that does not seem
like a safe option.

Since I have installed openldap on two lightly traffic nodes:
1) One node locked up
2) After lockup/restart the nodes did not re-establish two way
replication connection
3) I have out of sync data (which I do not believe was added during
the downtime caused by 1)

Linking to an RFC and implying that I "Don't read enough" is wrong. If
my light usage is bringing to light obvious bugs and I am not the only
one having these issues, not enough testing on the software
development side is being done.

As an administrator I ran 'make test' and watched
test050-syncrepl-multimaster complete. That coupled with the fact that
multi-master is no longer being labeled as an "experimental" feature
led me believe it worked reasonably well.

The RFC makes no mention of my #2 problem 'After lockup/restart the
nodes did not re-establish two way replication connection'. Is that
supposed to be the fault of the user? This is obviously a bug or an
edge case. This is not the fault of a user not reading enough. Which
is where the frustration is I think. People are willing to accept the
failure cases covered in the RFC, but the RFC is not a blanked
statement "WE told you not to run this" for every bug that appears.