Full_Name: Francis Swasey Version: 2.4.44 OS: Red Hat Enterprise Linux Server release 7.2 (Maipo) URL: http://www.uvm.edu/~fcs/openldap_its/syncprov_files.tar.gz Submission from: (NULL) (132.198.104.198) Once an hour, the primary server logs a syncprov_sendresp that does not include a csn. When that happens, the replica misses that there were changes and does not update and therefore falls behind. The URI provides files for the primary and replica. replica/slapd_db.ldif -- starting database replica/slapd.conf -- replica's slapd configuration file (you'll need to create your own ssl certs) replica/syncrepl_audit.sh -- Runs on the replica (edit the email address and the path to check_syncrepl replica/check_syncrepl -- Used by nagios to report csn being in sync or not primary/slapd_db.ldif -- starting database (almost exactly like the replica's version) primary/slapd.conf -- primary's slapd configuration file (you'll need to create your own ssl certs) primary/syncprov_audit.sh -- greps the system log (edit the path) looking for the syncprov error primary/makechanges.sh -- edit to change the name of the ldap server and run this to make constant changes The bad log entries that coincide with check_syncrepl finding the replica has fallen behind are like: Apr 4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 Apr 4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 Apr 4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 Apr 4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 They appear once an hour.
Today, I have built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base. I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens. The 2.4.42 code has been running for over two hours now (and just as in my production environment) has not failed at all. The options I am using with configure are: export CFLAGS="-DNDEBUG -D_REENTRANT -fPIC" export LIBS="-lpthread" ./configure \ -q \ --enable-dynamic \ --enable-static \ --with-threads=posix \ --with-tls=openssl \ --with-cyrus-sasl \ --enable-wrappers \ --enable-passwd \ --enable-cleartext \ --enable-crypt \ --enable-spasswd \ --enable-lmpasswd \ --enable-modules \ --enable-ipv6 \ \ --enable-slapd \ --enable-backends=mod \ --disable-ndb \ --disable-sql \ --disable-perl \ --disable-shell \ --enable-overlays=mod — Frank Swasey Systems Architecture & Administration
Second attempt - something about my first attempt seems to have tripped the spam filter at gauss. Today, I have built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base. I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens. The 2.4.42 code has been running for over two hours now (and just as in my production environment) has not failed at all. The options I am using with configure are: export CFLAGS="-DNDEBUG -D_REENTRANT -fPIC" export LIBS="-lpthread" ./configure \ -q \ --enable-dynamic \ --enable-static \ --with-threads=posix \ --with-tls=openssl \ --with-cyrus-sasl \ --enable-wrappers \ --enable-passwd \ --enable-cleartext \ --enable-crypt \ --enable-spasswd \ --enable-lmpasswd \ --enable-modules \ --enable-ipv6 \ \ --enable-slapd \ --enable-backends=mod \ --disable-ndb \ --disable-sql \ --disable-perl \ --disable-shell \ --enable-overlays=mod — Frank Swasey Systems Architecture & Administration
--On Tuesday, April 05, 2016 11:35 PM +0000 Frank.Swasey@uvm.edu wrote: It's ok, it is trivial to run through a mime decoder. So the regression was added in 2.4.43. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
Yesterday, I built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base. I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens. The 2.4.42 code has been running for over fifteen hours now has not failed at all. The script I used to build these is available here: http://www.uvm.edu/~fcs/openldap_its/build-openldap.sh — Frank Swasey Systems Architecture & Administration
I am building a version of 2.4.43 with ITS8281 removed. Will see if that resolves the issue. — Frank Swasey Systems Architecture & Administration -----Original Message----- From: openldap-bugs <openldap-bugs-bounces@openldap.org> on behalf of "Frank.Swasey@uvm.edu" <Frank.Swasey@uvm.edu> Date: Wednesday, April 6, 2016 at 7:45 AM To: "openldap-its@openldap.org" <openldap-its@openldap.org> Subject: Re: (ITS#8396) syncprov hourly fails to answer syncrepl >WWVzdGVyZGF5LCBJIGJ1aWx0IGFuZCB0ZXN0ZWQgMi40LjQ0LCAyLjQuNDMsIGFuZCAyLjQuNDIg >d2l0aG91dCB0aGUgbm9ybWFsIG1vZGlmaWNhdGlvbnMgdGhhdCBJIG1ha2UgdG8gdGhlIGNvZGUg >YmFzZS4gIA0KDQpJIGhhdmUgZm91bmQgdGhhdCBib3RoIDIuNC40NCBhbmQgMi40LjQzIGhhdmUg >dGhpcyBmYWlsdXJlIHdpdGhpbiBhbiBob3VyIG9mIHN0YXJ0aW5nIHRoZSBwcmltYXJ5IHNlcnZl >ciBhbmQgb25jZSBhbiBob3VyIGFmdGVyIGl0IGZpcnN0IGhhcHBlbnMuICBUaGUgMi40LjQyIGNv >ZGUgaGFzIGJlZW4gcnVubmluZyBmb3Igb3ZlciBmaWZ0ZWVuIGhvdXJzIG5vdyBoYXMgbm90IGZh >aWxlZCBhdCBhbGwuDQoNClRoZSBzY3JpcHQgSSB1c2VkIHRvIGJ1aWxkIHRoZXNlIGlzIGF2YWls >YWJsZSBoZXJlOg0KDQpodHRwOi8vd3d3LnV2bS5lZHUvfmZjcy9vcGVubGRhcF9pdHMvYnVpbGQt >b3BlbmxkYXAuc2gNCg0K4oCUDQpGcmFuayBTd2FzZXkNClN5c3RlbXMgQXJjaGl0ZWN0dXJlICYg >QWRtaW5pc3RyYXRpb24NCg0K > > >
I have backed out the ITS#8281 patch from my local builds of 2.4.43 and 2.4.44. They have both run without issue for two hours each. Therefore, I’m pretty sure that the cause of the syncprov failure is that patch. It may work well in an MMR environment, but it is failing in a single master setup. - Frank
--On Wednesday, April 06, 2016 7:58 PM +0000 Frank.Swasey@uvm.edu wrote: > SSBoYXZlIGJhY2tlZCBvdXQgdGhlIElUUyM4MjgxIHBhdGNoIGZyb20gbXkgbG9jYWwgYnVpb > GRz > IG9mIDIuNC40MyBhbmQgMi40LjQ0LiBUaGV5IGhhdmUgYm90aCBydW4gd2l0aG91dCBpc3N1Z > SBm > b3IgdHdvIGhvdXJzIGVhY2guICBUaGVyZWZvcmUsIEnigJltIHByZXR0eSBzdXJlIHRoYXQgd > Ghl > IGNhdXNlIG9mIHRoZSBzeW5jcHJvdiBmYWlsdXJlIGlzIHRoYXQgcGF0Y2guICBJdCBtYXkgd > 29y > ayB3ZWxsIGluIGFuIE1NUiBlbnZpcm9ubWVudCwgYnV0IGl0IGlzIGZhaWxpbmcgaW4gYSBza > W5n bGUgbWFzdGVyIHNldHVwLg0KDQotIEZyYW5rDQoNCg0K Hi Frank, I'm going to see if I can reproduce this in my setup with your configurations. I did notice that you don't have the syncprov overlay loaded on your replica, which is generally recommended. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
On 4/6/16, 4:01 PM, "openldap-bugs on behalf of quanah@zimbra.com" <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote: >Hi Frank, > >I'm going to see if I can reproduce this in my setup with your >configurations. I did notice that you don't have the syncprov overlay >loaded on your replica, which is generally recommended. Quanah, I’ve never heard that recommendation. Nor, do I find it in the slapo-syncprov man page or the Administrator’s guide discussion of the syncprov overlay and delta-syncrepl configuration. Please provide a reference to that recommendation. - Frank
--On Thursday, April 07, 2016 12:04 AM +0000 Frank Swasey <Frank.Swasey@uvm.edu> wrote: > On 4/6/16, 4:01 PM, "openldap-bugs on behalf of quanah@zimbra.com" > <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote: > >> Hi Frank, >> >> I'm going to see if I can reproduce this in my setup with your >> configurations. I did notice that you don't have the syncprov overlay >> loaded on your replica, which is generally recommended. > > Quanah, > > I've never heard that recommendation. Nor, do I find it in the > slapo-syncprov man page or the Administrator's guide discussion of the > syncprov overlay and delta-syncrepl configuration. Please provide a > reference to that recommendation. You just read it. ;) --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--On Thursday, April 07, 2016 12:05 AM +0000 Frank.Swasey@uvm.edu wrote: Hi Frank, Actually, I see this affects all forms of replication. On my ldap servers it is much more frequent than hourly: Apr 6 20:57:51 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 20:58:33 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:00:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:00:28 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:01:24 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:03:29 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:04:32 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:06:02 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:07:02 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:07:51 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:09:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:10:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:11:03 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:11:26 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:13:03 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:17:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:18:27 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:21:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:23:55 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:25:00 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:25:36 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:26:57 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:27:55 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:29:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:30:38 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:31:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:31:10 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:32:20 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:33:27 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:35:16 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:37:07 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:38:21 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:39:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:40:16 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:41:11 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:42:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:42:59 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:43:32 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:44:31 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:45:50 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:48:01 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:48:36 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:51:00 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:51:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:53:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:54:28 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:55:21 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:56:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:58:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 21:59:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 22:01:01 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 22:01:33 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 22:02:30 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 Apr 6 22:03:41 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--On Thursday, April 07, 2016 4:05 AM +0000 quanah@zimbra.com wrote: > --On Thursday, April 07, 2016 12:05 AM +0000 Frank.Swasey@uvm.edu wrote: > > Hi Frank, > > Actually, I see this affects all forms of replication. On my ldap > servers it is much more frequent than hourly: > Apr 6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004, > cookie=rid=102,sid=003 Interestingly, for MMR, this does not result in any data loss, as the changes are replicated w/o problem: MMR node taking writes: Apr 6 21:56:47 ldap01 slapd[34514]: slap_queue_csn: queueing 0x10bfcc80 20160407025647.107497Z#000000#003#000000 Apr 6 21:56:47 ldap01 slapd[34514]: slap_graduate_commit_csn: removing 0x10bfcc80 20160407025647.107497Z#000000#003#000000 Apr 6 21:56:47 ldap01 slapd[34514]: conn=2909 op=286509 RESULT tag=103 err=0 etime=0.000935 text= Apr 6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004, cookie=rid=102,sid=003 MMR node receiving writes: Apr 6 21:56:47 ldap02 slapd[61570]: do_syncrep2: rid=102 cookie=rid=102,sid=003 Apr 6 21:56:47 ldap02 slapd[61570]: slap_queue_csn: queueing 0x80ee240 20160407025647.107497Z#000000#003#000000 Apr 6 21:56:47 ldap02 slapd[61570]: slap_queue_csn: queueing 0x420c2c0 20160407025647.107497Z#000000#003#000000 Apr 6 21:56:47 ldap02 slapd[61570]: syncprov_matchops: skipping original sid 003 Apr 6 21:56:47 ldap02 slapd[61570]: slap_graduate_commit_csn: removing 0x420c2c0 20160407025647.107497Z#000000#003#000000 Apr 6 21:56:47 ldap02 slapd[61570]: slap_graduate_commit_csn: removing 0x80ee240 20160407025647.107497Z#000000#003#000000 So the change is still correctly replicated to the replicating MMR node. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote: > So the change is still correctly replicated to the replicating MMR node. I see the same behavior with the example configuration that was provided as well. While the CSN bit may not be logged, the change is replicated as it should be. Provider: Apr 6 22:59:28 zre-ldap002 slapd[22212]: slap_queue_csn: queueing 0x4ab3600 20160407035928.515652Z#000000#000#000000 Apr 6 22:59:28 zre-ldap002 slapd[22212]: slap_graduate_commit_csn: removing 0x4ab3600 20160407035928.515652Z#000000#000#000000 Apr 6 22:59:28 zre-ldap002 slapd[22212]: syncprov_sendresp: cookie=rid=100 Apr 6 22:59:28 zre-ldap002 slapd[22212]: conn=1003 op=1 RESULT tag=103 err=0 etime=0.084274 text= Replica: Apr 6 22:59:28 zre-ldap003 slapd[7947]: do_syncrep2: rid=100 cookie=rid=100 Apr 6 22:59:28 zre-ldap003 slapd[7947]: slap_queue_csn: queueing 0x37403c0 20160407035928.515652Z#000000#000#000000 Apr 6 22:59:28 zre-ldap003 slapd[7947]: slap_graduate_commit_csn: removing 0x37403c0 20160407035928.515652Z#000000#000#000000 Apr 6 22:59:28 zre-ldap003 slapd[7947]: syncrepl_message_to_op: rid=100 be_modify uid=fcs,ou=People,dc=uvm,dc=edu (0) So while the lack of the ,csn bit is annoying... I see no actual data loss, etc. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
On 4/7/16, 12:04 AM, "openldap-bugs on behalf of quanah@zimbra.com" <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote: >--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote: > >> So the change is still correctly replicated to the replicating MMR node. > >I see the same behavior with the example configuration that was provided as >well. While the CSN bit may not be logged, the change is replicated as it >should be. <snip> > >So while the lack of the ,csn bit is annoying... I see no actual data loss, >etc. Quanah, I’ll go peer deeper into the logs on the consumer side. If I find the same here, that indicates the provider is saying “there are changes” but not sending the CSN and syncrepl is retrieving and applying the changes, but not updating the CSN in the DSA of the consumer. Which suggests that syncrepl needs a patch or that I should not be trusting the CSN value in the consumer’s DSA in the check_syncrepl script (although, I can’t think of another quick way to assess whether the provider and consumer are in sync - off the top of my head). Eventually, syncprov does send the CSN and the consumer’s CSN gets updated, so things report as having caught up. - Frank
Quanah, I do see in the consumer log that it is retrieving and applying the updates. So the do_syncrep2 log entries also show the missing ,csn=... condition. Therefore, yes, there is no delay in data the actual replication being performed - there is a delay (as evidenced by the check of the CSN in the DSA of the provider and consumer report) in the CSN attribute for the DSA being updated. Clearly, syncrepl sees the CSN's coming across so it should be able to handle the lack of the CSN being in the response from the provider and get the DSA's CSN updated. I'll see if I can figure out a patch that will work. (sending from my personal address because I'm tired of gauss marking everything I send as probably spam - I'm not really that suspicious a character) - Frank On Thu, Apr 7, 2016 at 8:20 AM, Frank Swasey <Frank.Swasey@uvm.edu> wrote: > On 4/7/16, 12:04 AM, "openldap-bugs on behalf of quanah@zimbra.com" < > openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote: > > >--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote: > > > >> So the change is still correctly replicated to the replicating MMR node. > > > >I see the same behavior with the example configuration that was provided > as > >well. While the CSN bit may not be logged, the change is replicated as it > >should be. > > <snip> > > > >So while the lack of the ,csn bit is annoying... I see no actual data > loss, > >etc. > > > Quanah, > > I’ll go peer deeper into the logs on the consumer side. > > If I find the same here, that indicates the provider is saying “there > are changes” but not sending the CSN and syncrepl is retrieving and > applying the changes, but not updating the CSN in the DSA of the consumer. > Which suggests that syncrepl needs a patch or that I should not be trusting > the CSN value in the consumer’s DSA in the check_syncrepl script (although, > I can’t think of another quick way to assess whether the provider and > consumer are in sync - off the top of my head). Eventually, syncprov does > send the CSN and the consumer’s CSN gets updated, so things report as > having caught up. > > - Frank > > -- I am not young enough to know everything. - Oscar Wilde (1854-1900)
Quanah, Before I put any time into putting together a patch for syncrepl.c, I listened to what you said was generally recommended. I added the syncprov overlay to the consumer, even though it will never be a provider. I found that with the syncprov overlay after the syncrepl overlay in the config file, the DSA's CSN will still get out of sync. However, with the syncprov overlay before the syncrepl overlay - I have been running for the last three hours without any reports of the DSA's CSN being wrong on the replica (with one exception that was a check in the middle of an update and did not coincide with the missing csn value in the logs). Therefore, I'm going to develop a doc patch to fix the guide and man pages rather than a code patch. -- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
--On Thursday, April 07, 2016 9:22 PM +0000 Frank.Swasey@uvm.edu wrote: > Quanah, > > Before I put any time into putting together a patch for syncrepl.c, I > listened to what you said was generally recommended. I added the > syncprov overlay to the consumer, even though it will never be a > provider. I found that with the syncprov overlay after the syncrepl > overlay in the config file, the DSA's CSN will still get out of sync. > However, with the syncprov overlay before the syncrepl overlay - I have > been running for the last three hours without any reports of the DSA's > CSN being wrong on the replica (with one exception that was a check in > the middle of an update and did not coincide with the missing csn value > in the logs). > > Therefore, I'm going to develop a doc patch to fix the guide and man > pages rather than a code patch. Hi Frank, Great to hear!!! I knew there was a reason behind doing so... I've done so for years, but the exact reasons behind it have drifted into the fog of time. I did recall it was related to CSNs though, so thought that this might be it. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--On Thursday, April 07, 2016 2:53 PM -0700 Quanah Gibson-Mount <quanah@zimbra.com> wrote: > Hi Frank, > > Great to hear!!! I knew there was a reason behind doing so... I've done > so for years, but the exact reasons behind it have drifted into the fog > of time. I did recall it was related to CSNs though, so thought that > this might be it. Hi Frank, This is being caused by the syncprov checkpoint, which is why you see it every hour: syncprov-checkpoint 1000 60 I had one happen in between once... I think it was because it had hit 1000 operations, but not 100% on that.. It would imply that the operations counter is not being reset when the time counter syncs. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--On Thursday, April 07, 2016 11:57 PM +0000 quanah@zimbra.com wrote: Full summary: the syncprov checkpoint operation causes the CSN to be lost for the first write operation to occur after the checkpoint. It is important to note that no data is lost, all changes replicate as they should. However, the replica CSN is not updated in this scenario, making it appear that the replica is out of sync with the master. Adding the syncprov overlay to a replica database works around this issue by forcing the replica to track its internal CSNs, rather than relying on broadcasts from the master. It is trivial to reproduce this issue by setting a short checkpoint interval with the syncprov-checkpoint parameter. Example of the problem: We have a script modifying the userPassword attribute of an entry every 45 seconds. We have a syncprov-checkpoint set to happen every 5 minutes. >From the log we can see: Apr 7 18:00:38 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:05:53 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:11:09 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:16:25 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:17:55 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:21:41 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:26:57 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:32:13 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Stopping the script after the 18:32:13 operation, and examining the CSN values on each server, we see the following. master: [zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldap://zre-ldap002.eng.zimbra.com -s base -b "dc=uvm,dc=edu" contextCSN dn: dc=uvm,dc=edu contextCSN: 20160407233212.979013Z#000000#000#000000 replica: [zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldapi:// -s base -b "dc=uvm,dc=edu" contextCSN dn: dc=uvm,dc=edu contextCSN: 20160407233127.886702Z#000000#000#000000 Note that the CSNs are 45 seconds apart -- The interval of how often our writes are occurring. So the write op /prior/ to the checkpoint is the CSN value that is left on the replica in this case, as it ignores the empty CSN syncprov send response (thus not updating its CSN). While it is of course best practice to run the syncprov overlay on the replica to enforce internal CSN cohesion, it still should not be required, and this is clearly a bug that can cause admins to incorrectly believe that their servers are having replication issues. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
Thank you, Quanah for the clear description. I'm not sure if the problem is in syncprov or accesslog (the two overlays touched by the patch for ITS 8281). I do believe that the patch for ITS 8281 is the cause (since the problem goes away if that patch is removed). How can I help at this point? -- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
--On Friday, April 08, 2016 11:17 AM -0400 Frank Swasey <Frank.Swasey@uvm.edu> wrote: > Thank you, Quanah for the clear description. > > I'm not sure if the problem is in syncprov or accesslog (the two overlays > touched by the patch for ITS 8281). I do believe that the patch for ITS > 8281 is the cause (since the problem goes away if that patch is removed). > > How can I help at this point? Hi Frank, I would continue with the documentation update noting that it is recommended to run the syncprov overlay on replicas, so that they manage their CSN status based off updated received, rather than syncprov broadcasts. It's really the correct way to configure a replica, regardless of this bug. ;) I definitely wouldn't revert anything related to ITS8281, given that it was fixing some serious issues, whereas this issue, while annoying, doesn't actually cause harm, and has a workaround that lines up with best practices anyway. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
Today at 11:03am, quanah@zimbra.com wrote: > I would continue with the documentation update noting that it is > recommended to run the syncprov overlay on replicas, so that they manage > their CSN status based off updated received, rather than syncprov > broadcasts. It's really the correct way to configure a replica, regardless > of this bug. ;) Yeah -- so you keep saying ;) ... Patch attached. > I definitely wouldn't revert anything related to ITS8281, given that it was > fixing some serious issues, whereas this issue, while annoying, doesn't > actually cause harm, and has a workaround that lines up with best practices > anyway. No, I was not saying that 8281 should be removed - just that I have built without the 8281 patch and the "problem" does not happen in that case. Since I think we both agree that, strictly speaking, syncprov should not be required on a consumer only configuration; syncrepl probably needs to be updated to calculate the CSN if it is not present in the response (that's really all that is being added by having syncprov between syncrepl and the database - right?). I'm not familiar with the code, so I don't know how difficult that would be. -- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
--On Friday, April 08, 2016 1:12 PM -0400 Frank Swasey <Frank.Swasey@uvm.edu> wrote: > Today at 11:03am, quanah@zimbra.com wrote: > >> I would continue with the documentation update noting that it is >> recommended to run the syncprov overlay on replicas, so that they manage >> their CSN status based off updated received, rather than syncprov >> broadcasts. It's really the correct way to configure a replica, >> regardless of this bug. ;) > > Yeah -- so you keep saying ;) ... Patch attached. Thanks! As per <http://www.openldap.org/devel/contributing.html>, can you regenerate the patch as a git formatted patch, so it is properly attributed? Also we will need the IPR statement as noted in <http://www.openldap.org/devel/contributing.html#notice> >> I definitely wouldn't revert anything related to ITS8281, given that it >> was fixing some serious issues, whereas this issue, while annoying, >> doesn't actually cause harm, and has a workaround that lines up with >> best practices anyway. > > No, I was not saying that 8281 should be removed - just that I have built > without the 8281 patch and the "problem" does not happen in that case. > > Since I think we both agree that, strictly speaking, syncprov should not > be required on a consumer only configuration; syncrepl probably needs to > be updated to calculate the CSN if it is not present in the response > (that's really all that is being added by having syncprov between > syncrepl and the database - right?). I'm not familiar with the code, so > I don't know how difficult that would be. Right, Howard's aware of the issue and will be looking into what's necessary to fix it. ;) --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
git format-patch version attached. -- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
--On Friday, April 08, 2016 6:02 PM +0000 Frank.Swasey@uvm.edu wrote: > This message is in MIME format. The first part should be readable text, > while the remaining parts are likely unreadable without MIME-aware > tools. > > --0-599314220-1460134885=:98440 > Content-Type: text/plain; format=flowed; charset=US-ASCII > > git format-patch version attached. Perfect, thank you! --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
has patch has IPR
changed notes moved from Incoming to Documentation
--On Monday, April 04, 2016 7:57 PM +0000 Frank.Swasey@uvm.edu wrote: > The bad log entries that coincide with check_syncrepl finding the replica > has fallen behind are like: > > Apr 4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 > Apr 4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 > Apr 4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 > Apr 4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100 > > They appear once an hour. Hi Frank, This looks like ITS#8444, which now has a fix checked into OpenLDAP master. You probably want to apply it to your deployment. --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
*** This issue has been marked as a duplicate of issue 8444 ***