[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: mmr pair stops replicating: "consumer state is newer than provider"

To: openldap-technical@openldap.org
Subject: Re: mmr pair stops replicating: "consumer state is newer than provider"
From: btb <btb@bitrate.net>
Date: Wed, 5 Jul 2017 00:39:56 -0400
Content-language: en-US
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bitrate.net; s=default; t=1499229598; bh=8LqWM+zZc2GNgMskUkkxokig5d09WfuaHcp4UuctFbA=; h=Subject:To:References:From:Date:In-Reply-To:From; b=DolUIeM31D/cwEcgXdNpQuGusLfT6zJSjyetjqjzJHXr+iGwFK9xGDYZNSRqp8Srn kd03YHb+R2EF5CRyTCpDn4lz20kZURmUfgQEuuPXYWIRyGLUcPvPpusgNOtcoCbRxY DBca+QwmFt4QKf6zTTxhptN+M7lFmEzndPtqgP9c=
In-reply-to: <4EE5AA58F754C102C58D127F@[192.168.1.30]>
References: <460a87bc-ccb6-9553-bb6a-b57de306058e@bitrate.net> <WM!721ba9b642972ca17483c621787c32b1e0b1f650e884b9d0653d75b7c6a4b403485f248df406b00e352a97047c1e5e1c!@mailstronghold-1.zmailcloud.com> <B3D6DB90F83F55DBF692C0B8@[192.168.1.30]> <ffa99d26-b81a-6409-6e8c-12ee91d5487e@bitrate.net> <WM!250a43491a3881f6c8d454396d5edcdbdff347676182c3cd95de6b3570ee09feafbcccefba03f9d48b03b9bb3f10deb0!@mailstronghold-1.zmailcloud.com> <4DA177A2CB98B18529699F27@[192.168.1.30]> <a59a985e-8c4c-9f58-131a-c51b78b8874f@bitrate.net> <WM!001a7eaf2d319db0d65d5f48486c7e4d9457a2a4db8dbd04f89cdd1d17dc8fdb2a0d9b3ca6d0898ed0828dd9956d7bf6!@mailstronghold-1.zmailcloud.com> <73EF314E2CECAE34C9C098F4@[192.168.1.30]> <5022df33-7fb9-cbf7-3199-cf5638b2980a@bitrate.net> <WM!0ca4cf98c7e38b6e1e42c0cb58b01a04b10e85c77eb5112cdab0fa6acfbec97c5eec939932a3c8048b3b23078e6be829!@mailstronghold-2.zmailcloud.com> <4EE5AA58F754C102C58D127F@[192.168.1.30]>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Thunderbird/54.0

wow, that's a mess.
So #000# is serverID 0, which would be for any entries prior to movingto MMR. The fact that you have different values for #000# on dsa1accesslog vs the other 3 databases is disturbing.
It would appear DSA1 is serverID 1, and its CSNs make sense:

20170530214415.204052Z#000000#001#000000
20170530214415.204052Z#000000#001#000000
However, there's someting serious wrong with dsa2 (assuming it isserverID 2):
20170521175113.974560Z#000000#002#000000
20170619014933.531051Z#000000#002#000000
As this implies the primary DB received a write on 2017/06/19 @01:49:33, but the accesslog has not recorded this change, as it says thelast time there was a write op to the accesslog DB on #002# was2017/05/21 @ 17:51:13, nearly a month earlier. So it doesn't seem tothink you've done a write op directly against serverID 002.

thanks. i think i've managed to clean up the mess, and replications isflowing again. i've exorcized the old serverid 000 references, andverified each server's accesslog is getting updated as localmodifications occur.


contextcsns seem to be a bit more sane now, hopefully?

>ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'cn=config' -s base 'olcserverid'

Enter LDAP Password:
dn: cn=config
olcServerID: 1

>ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'cn=config' -s base 'olcserverid'

Enter LDAP Password:
dn: cn=config
olcServerID: 2

>ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'dc=example,dc=org' -s base 'contextcsn'

Enter LDAP Password:
dn: dc=example,dc=org
contextCSN: 20170705042207.590054Z#000000#001#000000
contextCSN: 20170704183515.872465Z#000000#002#000000

>ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'dc=example,dc=org' -s base 'contextcsn'

Enter LDAP Password:
dn: dc=example,dc=org
contextCSN: 20170705042207.590054Z#000000#001#000000
contextCSN: 20170704183515.872465Z#000000#002#000000

>ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'cn=accesslog' -s base 'contextcsn'

Enter LDAP Password:
dn: cn=accesslog
contextCSN: 20170705042145.957972Z#000000#001#000000
contextCSN: 20170704183515.872465Z#000000#002#000000

>ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b'cn=accesslog' -s base 'contextcsn'

Enter LDAP Password:
dn: cn=accesslog
contextCSN: 20170705042145.957972Z#000000#001#000000
contextCSN: 20170704183515.872465Z#000000#002#000000

i've also increased accesslog data retention from 7 days to 14 days, asa bit of a compensation for the infrequent writes, and i'll implement a"no-op" cron job as well, as a fail safe. are then any pitfalls i maynot be considering with a 14 day accesslog retention period? is thattoo long according to "typical" consensus?

for posterity's sake, after the mess was cleaned up, once a proper writeoccurred on each master, and the accesslog db was updated and csnsbrought in line, replication began flowing again, without the need for arestart on either side [at least in this particular case, anyway].


-ben

Prev by Date: Re: package OpenLDAP and lmdb
Next by Date: client-pr option for meta backend
Index(es):
- Chronological
- Thread