[Date Prev][Date Next] [Chronological] [Thread] [Top]

Odd MMR behaviour with delta-syncrepl and refreshAndPersist



Hi,

I'd discussed this in the IRC channel a while back and we were all a bit
stumped so I thought it would be worth sending it to the mailing list.
If I use delta-syncrepl in an MMR cluster with >2 nodes I get the
following behaviour when adding and deleting objects:

Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=033 cookie=rid=033,sid=006,csn=20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659f80 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659bc0
20160603133551.231411Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659e80 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncprov_matchops: skipping original sid 006
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659e80
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659f80
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncrepl_message_to_op: rid=033 be_add
cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (0)
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659fc0 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncprov_sendresp: to=004,
cookie=rid=032,sid=005,csn=20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 cookie=rid=031,sid=004
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x46599c0 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0xb58a140 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x46599c0
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659fc0
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0xb58a140
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncrepl_message_to_op: rid=031 be_add
cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (68)
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 delta-sync lost sync on
(reqStart=20160603133734.000001Z,cn=accesslog), switching to REFRESH
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 LDAP_RES_INTERMEDIATE - SYNC_ID_SET

It looks very much like a time sync issue but I've confirmed that my
servers are all synced against a single common NTP server and are as
closely synced time-wise as could be reasonably expected:

[root@alder ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 39 ms
   polling server every 64 s
[root@beech ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 38 ms
   polling server every 64 s
[root@oak ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 38 ms
   polling server every 64 s
[root@rowan ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 39 ms
   polling server every 64 s

In terms of server config the olcSyncrepl directives are of the form:

olcSyncrepl: {0}rid=31 provider=ldaps://beech.authorise.is.ed.ac.uk/
realm=EASE.ED.AC.UK bindmethod=sasl saslmech=gssapi
authcid="replicator.authorise.is.ed.ac.uk@EASE.ED.AC.UK"
searchbase="dc=authorise,dc=ed,dc=ac,dc=uk" logbase="cn=accesslog"
logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
schemachecking=on type=refreshAndPersist retry="60 +" tls_reqcert=never
timeout=1 keepalive=300:12:5 syncdata=accesslog

with a line for each of the 4 servers.

and my syncprov config on the database being replicated is:
dn: olcOverlay={2}syncprov,olcDatabase={1}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcConfig
objectClass: top
objectClass: olcSyncProvConfig
olcOverlay: {2}syncprov
structuralObjectClass: olcSyncProvConfig

If I have 2 servers in the cluster it works fine, if I use pull-based
replication (refreshOnly) it also works fine and if I use traditional
syncrepl with refreshAndPersist it works fine.

I'm using OpenLDAP 2.4.44 on Centos 6 and 7 x86_64 with the same behaviour.

Any ideas? At the moment I've proceeded to go-live with pull-based
replication but it would be nice to get push-based replication working.
Also while I have the "old" servers sitting there it makes sense to use
these for testing/debugging before retiring them.
-- 
/****************************

Mark Cairney
ITI Enterprise Services
Information Services
University of Edinburgh

Tel: 0131 650 6565
Email: Mark.Cairney@ed.ac.uk
PGP: 0x435A9621

*******************************/

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Attachment: signature.asc
Description: OpenPGP digital signature