[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: multi / standby master: incomplete replication after downtime (?)



On 18/08/2010 16:28, Elmar Marschke wrote:
On 18.08.2010 15:10, Oliver Liebel wrote:
the mmr config was tested many times and should work as expected,
but your clockskew below may be to great for mmr to work as intended

are your servers are physical/paravirt vms or full emulated vms?
if they are full emulated (eg vmware server), you will always run into
major clock skews.

check out / set some of your ntp settings, e.g.
- tinker panic 0
- server <....> minpoll <value> maxpoll <value>

yes, the time thing looks suspect to me also. I will try to get that
better working; thanks for your hints.

But now, here's some log stuff; as announced ;) :

logfile excerpts from MASTER (hostname: ldapmaster) and STANDBY
(hostname: ldapslave). I tried to make it as clearly arranged as
possible by inserting self-explaining (hopefully ;)) tags into the
logfiles. Here's what i did; and where:

- started both machines in a well working state (replication fine in
every detail; freshly converted slapd.conf, freshly replicated). Both
machines up.
- insert tag "===_BEGIN_CHANGES_WHILE_BOTH_UP_===" into logfiles on both
machines (script based)

- MASTER: deleted user hcallahan (using jxplorer)
- MASTER: added user mmouse (using ldapadd)
- MASTER: changed attribute telephone number user vcorleone to
"0800-mafia" (using jxplorer)
- MASTER: insert tag "===_END_CHANGES_WHILE_BOTH_UP_==="

- insert tag "===_END_CHANGES_WHILE_BOTH_UP_===" into logfiles on both
machines

- STANDBY: rcldap stop

- insert tag "===_BEGIN_CHANGES_WHILE_STANDBY_DOWN_==="

- MASTER: delete user jmcclane
- MASTER: add user ggoofy
- MASTER: change telephone number user ckent to "0800-superman"

- insert tag "===_END_CHANGES_WHILE_STANDBY_DOWN_==="
- insert tag "===_BEGIN_STANDBY_COMING_BACK_==="

- STANDBY: rcldap start

- insert tag "===_END_STANDBY_COMING_BACK_==="

The tags are in MASTER's and STANDBY's logfile, so one can compare
what's going on at the same point in time.

The result on STANDBY at the end again is as i wrote before:
STANDBY:
- every operation while it was up was successfully replicated
- the operations while it was down are partially replicated after coming
up again:
-- deletion user jmcclane: OK
-- adding user ggoofy: obviously did not happen; not available on STANDBY
-- changing phone number ckent: did not happen; on STANDBY it's the same
as before (means it differs now from the value it has on MASTER).

Here's the logfile of MASTER:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
===_BEGIN_CHANGES_WHILE_BOTH_UP_===
Aug 18 15:30:04 ldapmaster slapd[8017]: slap_queue_csn: queing
0x7f00f317b580 20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: slap_graduate_commit_csn:
removing 0x7f00fb4ee230 20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: do_syncrep2: rid=001
cookie=rid=001,csn=20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: do_syncrep2: rid=001 CSN too
old, ignoring 20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: do_syncrep2: rid=002
cookie=rid=002,csn=20100818133004.663851Z#000000#000#000000
Aug 18 15:30:04 ldapmaster slapd[8017]: do_syncrep2: rid=002 CSN too
old, ignoring 20100818133004.663851Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: slap_queue_csn: queing
0x7f00f397d060 20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: slap_graduate_commit_csn:
removing 0x7f00fb4ed9d0 20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: do_syncrep2: rid=001
cookie=rid=001,csn=20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: do_syncrep2: rid=001 CSN too
old, ignoring 20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: do_syncrep2: rid=002
cookie=rid=002,csn=20100818133025.570081Z#000000#000#000000
Aug 18 15:30:25 ldapmaster slapd[8017]: do_syncrep2: rid=002 CSN too
old, ignoring 20100818133025.570081Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: slap_queue_csn: queing
0x7f00f497e0f0 20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: slap_graduate_commit_csn:
removing 0x7f00ec114bd0 20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: syncprov_sendresp:
cookie=rid=001,csn=20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: do_syncrep2: rid=001
cookie=rid=001,csn=20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: do_syncrep2: rid=001 CSN too
old, ignoring 20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: do_syncrep2: rid=002
cookie=rid=002,csn=20100818133045.686887Z#000000#000#000000
Aug 18 15:30:45 ldapmaster slapd[8017]: do_syncrep2: rid=002 CSN too
old, ignoring 20100818133045.686887Z#000000#000#000000
===_END_CHANGES_WHILE_BOTH_UP_===

Clearly, slapd has not taken into account the SID parameters. You can see this in these CSN, the 3rd fields (the 3 digits before the last #) is 000, which is the default SID.

The reason for this is most likely that slapd cannot recognize that "ldapmaster.local.site" refers to the machine it's running on. I can see a few options to solve this:
- configure /etc/hosts so that host --fqdn returns this name
- run slapd with -h ldap://ldapmaster.local.site

After changing it, I recommend a full re-sync of your backup instance.

This should help!

Jonathan
--
==========================================
Jonathan CLARKE
------------------------------------------
Normation
44 rue Cauchy, 94110 Arcueil, France
------------------------------------------
Telephone:  +33 (0)1 83 62 26 96
------------------------------------------
Web:        http://www.normation.com/
==========================================