Issue 7274 - delta-syncrepl MMR infinite loop
Summary: delta-syncrepl MMR infinite loop
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: documentation (show other issues)
Version: 2.4.31
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-16 22:27 UTC by Quanah Gibson-Mount
Modified: 2012-05-30 02:41 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Quanah Gibson-Mount 2012-05-16 22:27:43 UTC
Full_Name: Quanah Gibson-Mount
Version: 2.4.31
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.108.184.39)


The following problem did not occur for me in previous OpenLDAP releases. 
However, I can reproduce it 100% of the time with OpenLDAP 2.4.31.

Basic configuration scenario is as follows:

Set up a stand alone OpenLDAP server -- Just cn=config and primary DB.

Basic database for the server is created (note all entries now have SID value
0)

Update the server to be delta-sync MMR -- Set sid=1, Mirror Mode, and add
accesslog DB.  Add syncrepl stanza for a second server

Install second server as a delta-sycnrepl MMR pair -- cn=config, accesslog DB,
primary DB, SID=2

Servers now go into infinite loops due to the SID=0 entries.

Server SID=1 logs:

May 16 14:33:24 zre-ldap002 slapd[17422]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20120516212017.543213Z#000000#000#000000
May 16 14:33:24 zre-ldap002 slapd[17422]: slap_queue_csn: queing 0x58d2ced
20120516212017.543213Z#000000#000#000000
May 16 14:33:24 zre-ldap002 slapd[17422]: slap_graduate_commit_csn: removing
0x4699f50 20120516212017.543213Z#000000#000#000000
May 16 14:33:24 zre-ldap002 slapd[17422]: syncrepl_message_to_op: rid=100 be_add
cn=zimbra (68)
May 16 14:33:24 zre-ldap002 slapd[17422]: do_syncrep2: rid=100 delta-sync lost
sync on (reqStart=20120516213237.000019Z,cn=accesslog), switching to REFRESH
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_message_to_entry: rid=100 DN:
cn=zimbra, UUID: add167b2-33e8-1031-8f62-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 inserted UUID
add167b2-33e8-1031-8f62-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: dn_callback : entries have identical
CSN cn=zimbra 20120516212017.543213Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 be_search (0)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 cn=zimbra
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 entry
unchanged, ignored (cn=zimbra)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_message_to_entry: rid=100 DN:
cn=admins,cn=zimbra, UUID: add18ab2-33e8-1031-8f63-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 inserted UUID
add18ab2-33e8-1031-8f63-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: dn_callback : entries have identical
CSN cn=admins,cn=zimbra 20120516212017.544170Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 be_search (0)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
cn=admins,cn=zimbra
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 entry
unchanged, ignored (cn=admins,cn=zimbra)

(etc, going through all 000 SID entries).

Once it finishes, it starts over:


May 16 14:33:25 zre-ldap002 slapd[17422]: dn_callback : entries have identical
CSN cn=com_zimbra_click2call_mitel,cn=zimlets,cn=zimbra
20120516212615.543144Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 be_search (0)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
cn=com_zimbra_click2call_mitel,cn=zimlets,cn=zimbra
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 entry
unchanged, ignored (cn=com_zimbra_click2call_mitel,cn=zimlets,cn=zimbra)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_message_to_entry: rid=100 DN:
cn=zre-ldap003.eng.vmware.com,cn=servers,cn=zimbra, UUID:
5a64830a-33ea-1031-8fbb-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 inserted UUID
5a64830a-33ea-1031-8fbb-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: dn_callback : entries have identical
CSN cn=zre-ldap003.eng.vmware.com,cn=servers,cn=zimbra
20120516213323.905859Z#000000#001#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 be_search (0)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
cn=zre-ldap003.eng.vmware.com,cn=servers,cn=zimbra
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 entry
unchanged, ignored (cn=zre-ldap003.eng.vmware.com,cn=servers,cn=zimbra)
May 16 14:33:25 zre-ldap002 slapd[17422]: do_syncrep2: rid=100
LDAP_RES_SEARCH_RESULT
May 16 14:33:25 zre-ldap002 slapd[17422]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20120516213323.905859Z#000000#001#000000;20120516213237.680889Z#000000#002#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: slap_queue_csn: queing 0x5805da0
20120516213237.680889Z#000000#002#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: slap_graduate_commit_csn: removing
0x5805620 20120516213237.680889Z#000000#002#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: do_syncrep2: rid=100
cookie=rid=000,sid=002,csn=20120516212017.543213Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: slap_queue_csn: queing 0x58d2ced
20120516212017.543213Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: slap_graduate_commit_csn: removing
0x5805bc0 20120516212017.543213Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_message_to_op: rid=100 be_add
cn=zimbra (68)
May 16 14:33:25 zre-ldap002 slapd[17422]: do_syncrep2: rid=100 delta-sync lost
sync on (reqStart=20120516213237.000019Z,cn=accesslog), switching to REFRESH
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_message_to_entry: rid=100 DN:
cn=zimbra, UUID: add167b2-33e8-1031-8f62-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 inserted UUID
add167b2-33e8-1031-8f62-f1c6fd57f8b3
May 16 14:33:25 zre-ldap002 slapd[17422]: dn_callback : entries have identical
CSN cn=zimbra 20120516212017.543213Z#000000#000#000000
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 be_search (0)
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 cn=zimbra
May 16 14:33:25 zre-ldap002 slapd[17422]: syncrepl_entry: rid=100 entry
unchanged, ignored (cn=zimbra)


On the second MMR master server, it logs:

May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138856 SRCH attr=* +
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138856 SEARCH RESULT
tag=101 err=0 nentries=0 text=
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138857 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138857 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
May 16 14:36:15 zre-ldap003 slapd[26137]: syncprov_search_response:
cookie=rid=100,sid=002,csn=20120516212644.426510Z#000000#000#000000;20120516213448.200519Z#000000#001#000000;20120516213237.680085Z#000000#002#000000
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138858 ABANDON
msg=138858
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138859 SRCH base=""
scope=2 deref=0 filter="(objectClass=*)"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138859 SRCH attr=* +
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138859 SEARCH RESULT
tag=101 err=0 nentries=0 text=
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138860 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138860 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138861 ABANDON
msg=138861
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138862 SRCH base=""
scope=2 deref=0 filter="(objectClass=*)"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138862 SRCH attr=* +
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138862 SEARCH RESULT
tag=101 err=0 nentries=0 text=
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138863 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138863 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138864 ABANDON
msg=138864
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138865 SRCH base=""
scope=2 deref=0 filter="(objectClass=*)"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138865 SRCH attr=* +
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138865 SEARCH RESULT
tag=101 err=0 nentries=0 text=
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138866 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138866 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
May 16 14:36:15 zre-ldap003 slapd[26137]: syncprov_search_response:
cookie=rid=100,sid=002,csn=20120516212644.426510Z#000000#000#000000;20120516213448.200519Z#000000#001#000000;20120516213237.680085Z#000000#002#000000
May 16 14:36:15 zre-ldap003 slapd[26137]: conn=1002 op=138867 ABANDON
msg=138867



However, I can get a working delta-sync MMR pair if I do the following:


Set up a stand alone OpenLDAP server -- Just cn=config and primary DB.

Basic database for the server is created (note all entries now have SID value
0)

Update the server to be delta-sync master: add accesslog DB.

Install second server as a replica.

Update first server to be delta-sync MMR: Set sid=1, mirrormode=true on primary
DB, add syncrepl stanza for second server

Update second server to be delta-sync MMR: Set sid=2, mirrormode=true on primary
DB, add accesslog DB, add syncprel stanza for first server


In this scenario everything works as expected.

The issue seems to be caused in the first scenario by server 2 replicating the
DB with SID=0 bits from the primary server, and that looping back to the first
server endlessly, instead of being permanently ignored.



--Quanah
Comment 1 Quanah Gibson-Mount 2012-05-18 00:42:41 UTC
--On Wednesday, May 16, 2012 10:27 PM +0000 quanah@OpenLDAP.org wrote:

> Full_Name: Quanah Gibson-Mount
> Version: 2.4.31
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (75.108.184.39)

We can see that the script turning it into a master ran here:

Thu May 17 16:05:46 2012 *** Running as zimbra user: 
/opt/zimbra/libexec/zmldapenable-mmr -s 2 -m 
ldap://zre-ldap002.eng.vmware.com:389/

so 16:05:46

In the accesslog, we see:

dn: cn=accesslog
objectClass: auditContainer
cn: accesslog
structuralObjectClass: auditContainer
contextCSN: 20120517225152.913667Z#000000#000#000000
contextCSN: 20120517230823.615364Z#000000#001#000000
contextCSN: 20120517230546.409118Z#000000#002#000000

dn: reqStart=20120517230546.000019Z,cn=accesslog
objectClass: auditAdd
structuralObjectClass: auditAdd
reqStart: 20120517230546.000019Z
reqEnd: 20120517230546.000020Z
reqType: add
reqSession: 100
reqAuthzID: cn=config
reqDN: cn=zimbra
reqResult: 0
reqMod: objectClass:+ organizationalRole
reqMod: description:+ Zimbra Systems Application Data
reqMod: cn:+ zimbra
reqMod: structuralObjectClass:+ organizationalRole
reqMod: entryUUID:+ 40f78bea-34be-1031-8a5d-e1466f667e19
reqMod: creatorsName:+ cn=config
reqMod: createTimestamp:+ 20120517224907Z
reqMod: entryCSN:+ 20120517224907.221672Z#000000#000#000000
reqMod: modifiersName:+ cn=config
reqMod: modifyTimestamp:+ 20120517224907Z
reqEntryUUID: 40f78bea-34be-1031-8a5d-e1466f667e19
entryUUID: 948929e2-34c0-1031-9a14-c93bd10ff0f2
creatorsName: cn=config
createTimestamp: 20120517224907Z
entryCSN: 20120517224907.221672Z#000000#000#000000
modifiersName: cn=config
modifyTimestamp: 20120517224907Z

so it is tracking "000" as a third master?  This seems to be why the 
original server (which was 000 before being promoted to 001) replicates 
these entries back to itself.

--Quanah

--

Quanah Gibson-Mount
Sr. Member of Technical Staff
Zimbra, Inc
A Division of VMware, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 2 Howard Chu 2012-05-30 01:25:12 UTC
quanah@zimbra.com wrote:
> --On Wednesday, May 16, 2012 10:27 PM +0000 quanah@OpenLDAP.org wrote:
>
>> Full_Name: Quanah Gibson-Mount
>> Version: 2.4.31
>> OS: Linux 2.6
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (75.108.184.39)
>
> We can see that the script turning it into a master ran here:
>
> Thu May 17 16:05:46 2012 *** Running as zimbra user:
> /opt/zimbra/libexec/zmldapenable-mmr -s 2 -m
> ldap://zre-ldap002.eng.vmware.com:389/
>
> so 16:05:46
>
> In the accesslog, we see:
>
> dn: cn=accesslog
> objectClass: auditContainer
> cn: accesslog
> structuralObjectClass: auditContainer
> contextCSN: 20120517225152.913667Z#000000#000#000000
> contextCSN: 20120517230823.615364Z#000000#001#000000
> contextCSN: 20120517230546.409118Z#000000#002#000000
>
> dn: reqStart=20120517230546.000019Z,cn=accesslog
> objectClass: auditAdd
> structuralObjectClass: auditAdd
> reqStart: 20120517230546.000019Z
> reqEnd: 20120517230546.000020Z
> reqType: add
> reqSession: 100
> reqAuthzID: cn=config
> reqDN: cn=zimbra
> reqResult: 0
> reqMod: objectClass:+ organizationalRole
> reqMod: description:+ Zimbra Systems Application Data
> reqMod: cn:+ zimbra
> reqMod: structuralObjectClass:+ organizationalRole
> reqMod: entryUUID:+ 40f78bea-34be-1031-8a5d-e1466f667e19
> reqMod: creatorsName:+ cn=config
> reqMod: createTimestamp:+ 20120517224907Z
> reqMod: entryCSN:+ 20120517224907.221672Z#000000#000#000000
> reqMod: modifiersName:+ cn=config
> reqMod: modifyTimestamp:+ 20120517224907Z
> reqEntryUUID: 40f78bea-34be-1031-8a5d-e1466f667e19
> entryUUID: 948929e2-34c0-1031-9a14-c93bd10ff0f2
> creatorsName: cn=config
> createTimestamp: 20120517224907Z
> entryCSN: 20120517224907.221672Z#000000#000#000000
> modifiersName: cn=config
> modifyTimestamp: 20120517224907Z
>
> so it is tracking "000" as a third master?  This seems to be why the
> original server (which was 000 before being promoted to 001) replicates
> these entries back to itself.

The loop is caused by the patch to ITS#6872, which considers a consumer out of 
date whenever the number of CSNs in its sync request doesn't match the number 
known to the provider.

The data here is basically invalid: server1 has entries generated using SID=0 
but it has no contextCSN value with SID=0. It only sent SID=1 and SID=2 in its 
sync request. Server2, which just updated from server1, has a contextCSN for 
SID=0 in addition to 1 and 2 (and that's all correct).

Server1 should have always had a contextCSN value for SID=0 but doesn't. This 
problem would not occur if server1 was converted first from standalone into a 
single-master. I.e., load syncprov on it, let it scan the DB and generate the 
first sid=0 contextCSN, before turning it intu a MMR node.


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Quanah Gibson-Mount 2012-05-30 02:41:22 UTC
--On Tuesday, May 29, 2012 6:25 PM -0700 Howard Chu <hyc@openldap.org> 
wrote:

>> so it is tracking "000" as a third master?  This seems to be why the
>> original server (which was 000 before being promoted to 001) replicates
>> these entries back to itself.
>
> The loop is caused by the patch to ITS#6872, which considers a consumer
> out of date whenever the number of CSNs in its sync request doesn't match
> the number known to the provider.
>
> The data here is basically invalid: server1 has entries generated using
> SID=0 but it has no contextCSN value with SID=0. It only sent SID=1 and
> SID=2 in its sync request. Server2, which just updated from server1, has
> a contextCSN for SID=0 in addition to 1 and 2 (and that's all correct).
>
> Server1 should have always had a contextCSN value for SID=0 but doesn't.
> This problem would not occur if server1 was converted first from
> standalone into a single-master. I.e., load syncprov on it, let it scan
> the DB and generate the first sid=0 contextCSN, before turning it intu a
> MMR node.


For documentation purposes -- The issue occurred because I set olcServerID 
before I loaded syncprov.  Re-ordering my script to load syncprov (which 
then caused a contextCSN value to be correctly generated on Server1 for 
SID=0) fixed this loop.

--Quanah


--

Quanah Gibson-Mount
Sr. Member of Technical Staff
Zimbra, Inc
A Division of VMware, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 4 Quanah Gibson-Mount 2012-05-30 02:41:28 UTC
changed state Open to Closed
moved from Incoming to Documentation