[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#8396) syncprov hourly fails to answer syncrepl



--On Thursday, April 07, 2016 11:57 PM +0000 quanah@zimbra.com wrote:


Full summary:

the syncprov checkpoint operation causes the CSN to be lost for the first 
write operation to occur after the checkpoint.  It is important to note 
that no data is lost, all changes replicate as they should.

However, the replica CSN is not updated in this scenario, making it appear 
that the replica is out of sync with the master.  Adding the syncprov 
overlay to a replica database works around this issue by forcing the 
replica to track its internal CSNs, rather than relying on broadcasts from 
the master.

It is trivial to reproduce this issue by setting a short checkpoint 
interval with the syncprov-checkpoint parameter.

Example of the problem:

We have a script modifying the userPassword attribute of an entry every 45 
seconds.  We have a syncprov-checkpoint set to happen every 5 minutes. 
>From the log we can see:

Apr  7 18:00:38 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:05:53 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:11:09 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:16:25 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:17:55 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:21:41 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:26:57 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:32:13 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100

Stopping the script after the 18:32:13 operation, and examining the CSN 
values on each server, we see the following.

master:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H 
ldap://zre-ldap002.eng.zimbra.com -s base -b "dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233212.979013Z#000000#000#000000

replica:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldapi:// -s base -b 
"dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233127.886702Z#000000#000#000000

Note that the CSNs are 45 seconds apart -- The interval of how often our 
writes are occurring.  So the write op /prior/ to the checkpoint is the CSN 
value that is left on the replica in this case, as it ignores the empty CSN 
syncprov send response (thus not updating its CSN).

While it is of course best practice to run the syncprov overlay on the 
replica to enforce internal CSN cohesion, it still should not be required, 
and this is clearly a bug that can cause admins to incorrectly believe that 
their servers are having replication issues.

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc