[Date Prev][Date Next] [Chronological] [Thread] [Top]

Syncrepl cookie and "reset by peer" message

Sorry, I meant to put a different title on this submission.

-----Original Message-----
From: openldap-technical-bounces+robert.hanson=calabrio.com@OpenLDAP.org [mailto:openldap-technical-bounces+robert.hanson=calabrio.com@OpenLDAP.org] On Behalf Of Robert Hanson
Sent: Monday, August 31, 2009 6:15 PM
To: openldap-technical@openldap.org
Subject: RE: Slow LDAP

Hi all, a question about the syncrepl cookie.
2.4.17, on linux, in a multimaster mode with 2 nodes.

We're running a test where we remove data and add data to the database through a c++ application (calling ldap_delete_ext_s() and ldap_add_ext_s().  The test begins with a freshly created slapd database that has an empty tree (i.e, no nodes exist below a starting node (lcc=lcc1, ou=Company, o=CMP).  The test proceeds in these steps:
1) add in a small number of nodes and subnodes to (A); 
2) wait for the tree to replicate to (B)
3) remove the tree from (A)
4) wait for the tree to be removed from (B).
This repeats several times.

We always run this test on one node (A) and just look to make sure replication is working on the other node (B)

What we're seeing is the following behavior on A:  If we run this test through many cycles without restarting slapd on either node, everything works as expected.  

If we shut down slapd and restart it on (A) after some cycle (at the end of step 4) then we begin to see many messages in the log file similar to the following, for multiple nodes: syncprov_search_response: Entry ou=Wrapup,lcc=lcc1,ou=Company,o=CMP changed by peer, ignored

And, on (B), the nodes that on (A) have this message, are there but their organizationalUnit=glue

The difference, as far as I can tell, comes down to this line in syncprov.c(about line 2160):
	if ( sid == srs->sr_state.sid && srs->sr_state.numcsns ) {
		"Entry %s changed by peer, ignored\n",
		rs->sr_entry->e_name.bv_val, 0, 0 );

I log the value of srs->sr_stat.numcsns.  When I don't restart slapd, the value of numcsns is 0 (since there is no cookie in the db); after I restart slapd the value is 1 (this is the value read from the cookie in the db).  I've also logged the value of numcsns when the cookie is written when it is checkpointed, it is 1.  

I'm thinking this is a bug in the code.  The only difference in the two types of runs is that slapd is restarted on (A), and only restarted after sync is complete and both systems are quiescent.