Issue 8373 - replica with multiple providers hits errors
Summary: replica with multiple providers hits errors
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.44
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-18 23:01 UTC by Quanah Gibson-Mount
Modified: 2018-10-18 20:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Quanah Gibson-Mount 2016-02-18 23:01:41 UTC
Full_Name: Quanah Gibson-Mount
Version: 2.4.44
OS: Linux 3.13
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.111.52.177)


I have a MMR setup, with two providers, zre-ldap003 and zre-ldap005.  I have a
single replica, zre-ldap002, which is configured to pull updates from both
providers.  However, it continually gets an error 53 from zre-ldap005 after
writing an update to zre-ldap003.  As far as I can tell, all contextCSN data is
correct on all nodes.

replica:
zimbra@zre-ldap002:~$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w zimbra -s
base -b "" contextCSN
dn:
contextCSN: 20160218220547.523250Z#000000#001#000000
contextCSN: 20160218220604.321502Z#000000#002#000000

Master SID1:
[zimbra@zre-ldap005 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w zimbra -s
base -b "" contextCSN
dn:
contextCSN: 20160218220547.523250Z#000000#001#000000
contextCSN: 20160218220604.321502Z#000000#002#000000

[zimbra@zre-ldap005 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w zimbra -s
base -b "cn=accesslog" contextCSN
dn: cn=accesslog
contextCSN: 20160217223749.648918Z#000000#001#000000
contextCSN: 20160218220604.321502Z#000000#002#000000

Master SID2:
[zimbra@zre-ldap003 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w zimbra -s
base -b "" contextCSN
dn:
contextCSN: 20160218220547.523250Z#000000#001#000000
contextCSN: 20160218220604.321502Z#000000#002#000000

[zimbra@zre-ldap003 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w zimbra -s
base -b "cn=accesslog" contextCSN
dn: cn=accesslog
contextCSN: 20160218220547.523250Z#000000#001#000000
contextCSN: 20160218220604.321502Z#000000#002#000000
%%0

Note all contextCSNs have the same value everywhere.

However, when the replica attempts to replicate off of zre-ldap005, we get the
following error:

Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 fd=18 ACCEPT from
IP=10.137.242.52:38979 (IP=10.137.242.55:389)
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=0 EXT
oid=1.3.6.1.4.1.1466.20037
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=0 STARTTLS
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=0 RESULT oid= err=0
etime=0.000071 text=
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 fd=18 TLS established
tls_ssf=128 ssf=128 tls_proto=TLSv1.2 tls_cipher=SEED-SHA
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=1 BIND
dn="uid=zmreplica,cn=admins,cn=zimbra" metho3D3D128
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=1 BIND
dn="uid=zmreplica,cn=admins,cn=zimbra" mech=SIMPLE ssf=0
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=1 RESULT tag=97 err=0
etime=0.000154 text=
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=2 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=2 SRCH attr=reqDN reqType
reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=2 SEARCH RESULT tag=101
err=53 etime=0.000177 nentries=0 text=consumer state is newer than provider!
Feb 18 16:50:34 zre-ldap005 slapd[10555]: conn=1191 op=3 UNBIND
Feb 18 16:50:34 zre-ldap5 5 slapd[10555]: conn=1191 fd=18 closed


This started happening after I made a modification on zre-ldap005 followed by a
modification on zre-ldap003.  In that case, the following was logged on
zre-ldap002:

Feb 18 16:06:04 zre-ldap002 slapd[11266]: do_syncrep2: rid=101
cookie=rid=101,sid=002,csn=20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: slap_queue_csn: queueing 0x3589880
20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: do_syncrep2: rid=100
cookie=r%3=100,sid=001,csn=20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: slap_graduate_commit_csn: removing
0x3589880 20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: syncrepl_message_to_op: rid=101
be_modify cn=admins,cn=zimbra (0)
Feb 18 16:06:04 zre-ldap002 slapd[11266]: slap_queue_csn: queueing 0x3589740
20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: slap_graduate_commit_csn: removing
0x3589740 20160218220604.321502Z#000000#002#000000
Feb 18 16:06:04 zre-ldap002 slapd[11266]: do_syncrep2: rid=100 CSN pending,
ignoring 20160218220604.321502Z#000000#002#000000
(reqStart=20160218220604.326527Z,cn=accesslog)


So we see the write accepted from one master, and correctly rejected (not
replayed) from the other.

After this, there were no more modifications made to either master.  However, I
*did* make a modification to cn=config on the replica.  After that was done, is
when the error(53) started getting triggered:

Feb 18 16:27:12 zre-ldap002 slapd[11266]: slap_queue_csn: queueing 0x3587a40
20160218222712.155976Z#000000#000#000000
Feb 18 16:27:12 zre-ldap002 slapd[11266]: slap_graduate_commit_csn: removing
0x3587a40 20160218222712.155976Z#000000#000#000000
Feb 18 16:27:12 zre-ldap002 slapd[11266]: do_syncrep2: rid=100
LDAP_RES_SEARCH_RESULT
Feb 18 16:27:12 zre-ldap002 slapd[11266]: do_syncrep2: rid=100
LDAP_RES_SEARCH_RESULT (53) Server is unwilling to perform
Feb 18 16:27:12 zre-ldap002 slapd[11266]: do_syncrep2: rid=100 (53) Server is
unwilling to perform
Feb 18 16:27:12 zre-ldap002 slapd[11266]: do_syncrepl: rid=100 rc -2 retrying
Comment 1 Quanah Gibson-Mount 2016-02-18 23:38:51 UTC
--On Thursday, February 18, 2016 11:01 PM +0000 quanah@openldap.org wrote:


After stopping all servers, and restarting the masters, I see the same 
error (53) being returned when zre-ldap003 tries to connect to zre-ldap005. 
So this is not specific to the replica or the change to cn=config I made on 
the replica.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 2 Quanah Gibson-Mount 2016-02-19 00:37:06 UTC
--On Thursday, February 18, 2016 11:01 PM +0000 quanah@openldap.org wrote:


> Master SID1:
> [zimbra@zre-ldap005 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w
> zimbra -s base -b "" contextCSN
> dn:
> contextCSN: 20160218220547.523250Z#000000#001#000000
> contextCSN: 20160218220604.321502Z#000000#002#000000
>
> [zimbra@zre-ldap005 ~]$ ldapsearch -x -LLL -H ldapi:// -D cn=config -w
> zimbra -s base -b "cn=accesslog" contextCSN
> dn: cn=accesslog
> contextCSN: 20160217223749.648918Z#000000#001#000000
> contextCSN: 20160218220604.321502Z#000000#002#000000

Ooops, clearly a difference here between the main db and the accesslog DB.

However, nothing was logged, error wise.  Here is this modification:

Feb 18 16:05:47 zre-ldap005 slapd[10555]: slap_queue_csn: queueing 
0x3ce84c0 20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap005 slapd[10555]: slap_graduate_commit_csn: 
removing 0x3ce84c0 20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap005 slapd[10555]: syncprov_sendresp: 
cookie=rid=100,sid=001
Feb 18 16:05:47 zre-ldap005 slapd[10555]: syncprov_sendresp: to=002, 
cookie=rid=100,sid=001


On the other master, we have:


Feb 18 16:05:47 zre-ldap003 slapd[6947]: do_syncrep2: rid=100 
cookie=rid=100,sid=001
Feb 18 16:05:47 zre-ldap003 slapd[6947]: slap_queue_csn: queueing 0x36d58c0 
20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap003 slapd[6947]: slap_queue_csn: queueing 0x36d5800 
20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap003 slapd[6947]: syncprov_matchops: skipping 
original sid 001
Feb 18 16:05:47 zre-ldap003 slapd[6947]: slap_graduate_commit_csn: removing 
0x36d5800 20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap003 slapd[6947]: slap_graduate_commit_csn: removing 
0x36d58c0 20160218220547.523250Z#000000#001#000000
Feb 18 16:05:47 zre-ldap003 slapd[6947]: syncrepl_message_to_op: rid=100 
be_modify cn=admins,cn=zimbra (0)
Feb 18 16:05:47 zre-ldap003 slapd[6947]: syncprov_sendresp: 
cookie=rid=101,sid=002,csn=20160218220547.523250Z#000000#001#000000



In the accesslog, we have:

dn: reqStart=20160218220547.523064Z,cn=accesslog
objectClass: auditModify
structuralObjectClass: auditModify
reqStart: 20160218220547.523064Z
reqEnd: 20160218220547.523558Z
reqType: modify
reqSession: 1027
reqAuthzID: cn=config
reqDN: cn=admins,cn=zimbra
reqResult: 0
reqMod: description:= admin accounts
reqMod: entryCSN:= 20160218220547.523250Z#000000#001#000000
reqMod: modifiersName:= cn=config
reqMod: modifyTimestamp:= 20160218220547Z
reqEntryUUID: c3fbf53a-6a0c-1035-9563-ef445d2cc3b6
entryCSN: 20160218220547.523250Z#000000#001#000000
entryUUID: 822e1680-6ad7-1035-8bac-2751986a6fb8
creatorsName: cn=config
createTimestamp: 20160218220547Z
modifiersName: cn=config
modifyTimestamp: 20160218220547Z

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 3 Quanah Gibson-Mount 2016-02-19 01:33:45 UTC
--On Friday, February 19, 2016 12:37 AM +0000 quanah@zimbra.com wrote:

> --On Thursday, February 18, 2016 11:01 PM +0000 quanah@openldap.org wrote:

So I can trivially reproduce this.  I make a change on each master, and 
boom, they go out of sync and everything's broken.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 4 Quanah Gibson-Mount 2017-04-12 16:56:40 UTC
moved from Incoming to Software Bugs
Comment 5 OpenLDAP project 2018-10-18 20:29:03 UTC
Fixed by some other fix since the original report
Comment 6 Quanah Gibson-Mount 2018-10-18 20:29:03 UTC
changed notes
changed state Open to Closed
Comment 7 Quanah Gibson-Mount 2018-10-18 20:32:57 UTC
--On Friday, February 19, 2016 1:33 AM +0000 quanah@zimbra.com wrote:



Played around with this again today, and the issue no longer occurs.  It 
seems one of the many replication related fixes since I reported this ITS 
has solved the problem.  Closing the ITS.

--Quanah



--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>