[Date Prev][Date Next] [Chronological] [Thread] [Top]

Failing syncrepl connections because of an older contextCSN on the master



OpenLDAP version(s): 2.4.32 and recent git master (54a111907b48900e1ce5de12a9d87818a334423d)
OS: CentOS 5, 32-bit

In a master/slave delta-syncrepl configuration, syncrepl connections from the fully
synchronized and restarted slave to the freshly loaded master will fail because the master
uses an earlier contextCSN. This state persists until the first modification on the master,
when the contextCSN will become newer than the consumer's, whose next syncrepl retry will
then succeed.

Details (master and slave configurations are attached to this message):

The master is freshly loaded with the following LDIF:

dn: dc=example,dc=com
objectClass: locality
objectClass: dcObject
dc: example

dn: ou=People,dc=example,dc=com
objectClass: organizationalUnit
ou: People

When the master is started, there is only one entry in cn=accesslog:

dn: cn=accesslog
objectClass: auditContainer
cn: accesslog
structuralObjectClass: auditContainer
entryDN: cn=accesslog
subschemaSubentry: cn=Subschema
contextCSN: 20120910091205.088888Z#000000#000#000000
hasSubordinates: FALSE

The root entry of the main database looks like this:

dn: dc=example,dc=com
objectClass: locality
objectClass: dcObject
dc: example
structuralObjectClass: locality
entryUUID: 921f3734-24fc-4efa-ac52-10aa4d293332
creatorsName: cn=Manager,dc=example,dc=com
createTimestamp: 20120910090757Z
entryCSN: 20120910090757.483217Z#000000#000#000000
modifiersName: cn=Manager,dc=example,dc=com
modifyTimestamp: 20120910090757Z
entryDN: dc=example,dc=com
subschemaSubentry: cn=Subschema
auditContext: cn=accesslog
contextCSN: 20120910091205.089003Z#000000#000#000000
hasSubordinates: TRUE

When the slave server (with an empty database) is started, it connects to the master and
synchronizes. Their synchronization dialog, captured with -d sync, looks like this:

(master)
504db377 syncprov_search_response: cookie=rid=000,csn=20120910091205.089003Z#000000#000#000000
504db377 syncprov_search_response: cookie=rid=000,csn=20120910091205.088888Z#000000#000#000000

(slave)
504db370 syncrepl_message_to_entry: rid=000 DN: dc=example,dc=com, UUID: 921f3734-24fc-4efa-ac52-10aa4d293332
504db370 syncrepl_entry: rid=000 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
504db370 syncrepl_entry: rid=000 inserted UUID 921f3734-24fc-4efa-ac52-10aa4d293332
504db370 syncrepl_entry: rid=000 be_search (32)
504db370 syncrepl_entry: rid=000 dc=example,dc=com
504db370 syncrepl_entry: rid=000 be_add dc=example,dc=com (0)
504db370 syncrepl_message_to_entry: rid=000 DN: ou=People,dc=example,dc=com, UUID: 08fe53fb-724a-4eca-a57e-3c271d99dd7e
504db370 syncrepl_entry: rid=000 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
504db370 syncrepl_entry: rid=000 inserted UUID 08fe53fb-724a-4eca-a57e-3c271d99dd7e
504db370 syncrepl_entry: rid=000 be_search (0)
504db370 syncrepl_entry: rid=000 ou=People,dc=example,dc=com
504db370 syncrepl_entry: rid=000 be_add ou=People,dc=example,dc=com (0)
504db370 do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT
504db370 do_syncrep2: rid=000 cookie=rid=000,csn=20120910091205.089003Z#000000#000#000000
504db370 slap_queue_csn: queing 0x836dcf8 20120910091205.089003Z#000000#000#000000
504db370 slap_graduate_commit_csn: removing 0x8371328 20120910091205.089003Z#000000#000#000000
504db370 do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
504db370 do_syncrep2: rid=000 cookie=rid=000,csn=20120910091205.088888Z#000000#000#000000

The master sends the cookie with the contextCSN belonging to cn=accesslog. If the slave is
now restarted, what happens is this:

504db38d do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT
504db38d do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT (53) Server is unwilling to perform
504db38d do_syncrep2: rid=000 (53) Server is unwilling to perform
504db38d do_syncrepl: rid=000 rc -2 retrying

The slave sends csn=20120910091205.089003Z#000000#000#000000, but the master uses
csn=20120910091205.088888Z#000000#000#000000. When the master is subsequently modified,
it will use a newer contextCSN, and the slave's next retry will succeed.

Is this worthy of an ITS, since the situation is unlikely to occur in practice, and will
correct itself, only leaving the slave out of sync for the duration of the retry interval?

i.
include ../servers/slapd/schema/core.schema

argsfile master/run/slapd.args
pidfile master/run/slapd.pid

database mdb
suffix cn=accesslog
directory master/accesslog
index entryCSN,objectClass,reqEnd,reqResult,reqStart eq

overlay syncprov
syncprov-nopresent TRUE
syncprov-reloadhint TRUE

database mdb
suffix "dc=example,dc=com"
rootdn "cn=Manager,dc=example,dc=com"
rootpw secret
maxsize 256000000
directory master/main
index objectClass eq
index entryCSN,entryUUID eq

overlay syncprov
syncprov-checkpoint 1000 60

overlay accesslog
logdb cn=accesslog
logops writes
logsuccess TRUE
logpurge 07+00:00 01+00:00
include ../servers/slapd/schema/core.schema

argsfile replica/run/slapd.args
pidfile replica/run/slapd.pid

database mdb
suffix "dc=example,dc=com"
rootdn "cn=Manager,dc=example,dc=com"
rootpw secret
maxsize 256000000
directory replica/main
index objectClass eq
index entryUUID eq
syncrepl rid=0
	provider=ldap://localhost:6389/
	bindmethod=simple
	binddn="cn=Manager,dc=example,dc=com"
	credentials=secret
	searchbase="dc=example,dc=com"
	logbase="cn=accesslog"
	logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
	schemachecking=on
	type=refreshAndPersist
	retry="60 +"
	syncdata=accesslog
updateref "ldap://localhost:6389";