[Date Prev][Date Next] [Chronological] [Thread] [Top]

Syncrepl refreshOnly replication failures



Openldap,

We recently upgraded our openldap master from 2.2.30 to 2.3.30. Most of our replicas (we have some 5 production and 3 or 4 test replicas) are still 2.2.x versions of openldap. We did testing of the syncrepl refreshOnly replication across these versions before the upgrade and the replication seemed to work fine.

After the upgrade to our production master we had problems with bdb lock exhaustion - as I've noticed others have run into. With the new master this problem manifested itself in the master seeming to "loop" consuming CPU while trying to support replication. However, it was still able to support direct reads and writes - unlike the 2.2.x master that just hung in that circumstance. I'm not sure such resiliency in the face of its replication failures was a good thing.

Regardless, we increased the number of available bdb locks, did a bdb recovery, and the restarted master has been stable since.

However, after that time we noticed that some of the replicas had parts of their directories that weren't being replicated from the master. In an effort to deal with this problem we reinitialized our replicas (zeroed out their DBs and re initialized from the master).

So far so good after that point.

My question to this list is, does anybody know if this re initialization will suffice? That is, do we (for example):

1. Also need to reinitialize our master (e.g. rebuild it from an ldif or from another replica).

2. Also need to upgrade the software versions for all our replicas (e.g. to 2.3.x).

3. ? Is there something else we need to do to insure that our replicas will be faithful to our new master.

#1 and #2 above impose a significant costs in our environment - different organizational entities with independently administered openldap servers. #1 requires a restart of all the replicas. #2 of course has the much more significant cost of upgrading all the replicas. #3?

Any thoughts appreciated. Our understanding of the progressing syncrepl implementations isn't sufficient to allow us to distinguish between the above.

--Jed http://www.nersc.gov/~jed/