Re: Different contextCSN on multi-master replication servers

Thank you both Howard and Leonid.

Yes, you're right, it happened the other way around; the modification was made on the second server and propagated back to the first one. However, I don't know why the change was in turn returned to the second server - SIDs are ok as far as I know.

However, as Leonid mentioned, both servers weren't synchronized correctly anyway. Turns out that yesterday we upgraded to 2.4.39-6 (as I stated in my first mail). Previously, we were using 2.4.39-3 and it seemed to work fine. We also noted that 2.4.39-6 produced some additional issues (like client and syncrepl sockets dying without any apparent reason), so today we downgraded back to 2.4.39-3 and everything seems to work just fine again.

We had a look at the changelog from 2.4.39-3 to 2.4.39-6 and no change seems to be explicitly syncrepl related, but rather related to LDAPS (strange, as we use the LDAP protocol for syncrepl instead of LDAPS). Anyway, we'll keep version 2.4.39-3 as far as it works well.

Thanks.

Regards.

2015-04-21 23:53 GMT+01:00 Леонид Юрьев <leo@yuriev.ru>:

Hi Nicolás,

1) If contextCSN(s) are differs on servers, then are still not syncronized (or has a glitches).
http://www.openldap.org/lists/openldap-technical/201108/threads.html#00001

2) Replication takes a some time. Therefore contextCSN(s) may be equals only when some time was no any changes.

3) Make sure that the time is synchronized on servers (e.g. by using ntpdate).

4) Unfortunatelly, all current releases (include 2.4.39 and 2.4.40) have enough bugs in replication code.
For example, by ITS#8081 (http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8081) you could get segfault, but also lost (like undo) some changes by a replication.

5) We made a fork of OpenLDAP project for our usecase (highload TELCO-aware multi-master), it called ReOpenLDAP.
If you decide to build slapd from sources, I recommend use our ReOpenLDAP ;)

New features yet not documented in english man-pages, by you can translate by Google:
https://github.com/ReOpen/ReOpenLDAP/releases/tag/ReOpenLDAP-2.4.41-rc
https://github.com/ReOpen/ReOpenLDAP/commit/4fc4bc18dd4bd80909aa80700c5c19b0816ca120
https://github.com/ReOpen/ReOpenLDAP/commit/95808b156ee36a886523b7096a75d5099e9b44fc
https://github.com/ReOpen/ReOpenLDAP/commit/1c94bc17ec285388e8a8299399ed537754fc3028

Leonid.

2015-04-21 16:01 GMT+03:00 Nicolás Kovac Neumann <nkovacne@ull.edu.es>:

Hi,

We're currently using N-way multimaster replication on two servers for our LDAP directory, both for the config and the hdb databases. It's working fine mostly, but we've run into a possible issue with the syncrepl engine which we would like to cast light on. We're using CentOS 7 with openldap-servers version 2.4.39-6.

We made an update on one of the entries (server1, in this case), so server2 replicated that change (as shown below in the log):

     ==> server1/ldap.log <==
     Apr 21 13:38:55 server1 slapd[1835]: do_syncrep2: rid=002 cookie=rid=002,sid=002,csn=20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server1 slapd[1835]: syncrepl_message_to_entry: rid=002 DN: uid=user1,cn=subtree,dc=example,dc=org, UUID: 18a2436c-73ce-1030-95dd-b52dc05ced13
     Apr 21 13:38:55 server1 slapd[1835]: syncrepl_entry: rid=002 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
     Apr 21 13:38:55 server1 slapd[1835]: syncrepl_entry: rid=002 be_search (0)
     Apr 21 13:38:55 server1 slapd[1835]: syncrepl_entry: rid=002 uid=user1,cn=subtree,dc=example,dc=org
     Apr 21 13:38:55 server1 slapd[1835]: slap_queue_csn: queing 0x7ff8f42789f0 20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server1 slapd[1835]: slap_graduate_commit_csn: removing 0x7ff8f435e770 20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server1 slapd[1835]: syncrepl_entry: rid=002 be_modify uid=user1,cn=subtree,dc=example,dc=org (0)
     Apr 21 13:38:55 server1 slapd[1835]: syncprov_sendresp: cookie=rid=001,sid=001,csn=20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server1 slapd[1835]: slap_queue_csn: queing 0x7ff8f42789f0 20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server1 slapd[1835]: slap_graduate_commit_csn: removing 0x7ff8f41b7b90 20150421123855.643239Z#000000#002#000000

     ==> server2/ldap.log <==
     Apr 21 13:38:55 server2 slapd[1948]: slap_queue_csn: queing 0x7f897affb220 20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server2 slapd[1948]: syncprov_sendresp: to=001, cookie=rid=002,sid=002,csn=20150421123855.643239Z#000000#002#000000
     Apr 21 13:38:55 server2 slapd[1948]: slap_graduate_commit_csn: removing 0x7f89307f42a0 20150421123855.643239Z#000000#002#000000

Nothing strange up to now, however, if we query the contextCSN, it differs on both servers.

For server1, we have:

     contextCSN: 20150421123523.281736Z#000000#001#000000
     contextCSN: 20150421123417.889477Z#000000#002#000000

For server2, the value for server ID 001 differs:

     contextCSN: 20150421115324.003502Z#000000#001#000000
     contextCSN: 20150421123417.889477Z#000000#002#000000

However, the entry seems to replicate the entryCSN correctly on both servers:

     entryCSN: 20150421123417.889477Z#000000#002#000000

Is this the expected behavior? Shouldn't both contextCSN values match on both servers?

Thanks!

Regards,

Nicolás