8396 – syncprov hourly fails to answer syncrepl

Issue 8396 - syncprov hourly fails to answer syncrepl

Summary: syncprov hourly fails to answer syncrepl

Status:	VERIFIED DUPLICATE of issue 8444

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	documentation (show other issues)
Version:	2.4.44
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-04-04 18:57 UTC by frank.swasey@uvm.edu
Modified:	2021-04-01 16:27 UTC (History)
CC List:	0 users

See Also:

Attachments
0001-Documentation-change-for-ITS-8396.patch (2.70 KB, text/plain) 2016-04-08 17:01 UTC, frank.swasey@uvm.edu	Details
ITS8396.patch (1.81 KB, text/plain) 2016-04-08 16:12 UTC, frank.swasey@uvm.edu	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description frank.swasey@uvm.edu 2016-04-04 18:57:35 UTC

Full_Name: Francis Swasey
Version: 2.4.44
OS: Red Hat Enterprise Linux Server release 7.2 (Maipo)
URL: http://www.uvm.edu/~fcs/openldap_its/syncprov_files.tar.gz
Submission from: (NULL) (132.198.104.198)


Once an hour, the primary server logs a syncprov_sendresp that does not include
a csn.  When that happens, the replica misses that there were changes and does
not update and therefore falls behind.

The URI provides files for the primary and replica.  

replica/slapd_db.ldif -- starting database
replica/slapd.conf -- replica's slapd configuration file (you'll need to create
your own ssl certs)
replica/syncrepl_audit.sh -- Runs on the replica (edit the email address and the
path to check_syncrepl
replica/check_syncrepl -- Used by nagios to report csn being in sync or not

primary/slapd_db.ldif -- starting database (almost exactly like the replica's
version)
primary/slapd.conf -- primary's slapd configuration file (you'll need to create
your own ssl certs)
primary/syncprov_audit.sh -- greps the system log (edit the path) looking for
the syncprov error
primary/makechanges.sh -- edit to change the name of the ldap server and run
this to make constant changes

The bad log entries that coincide with check_syncrepl finding the replica has
fallen behind are like:

Apr  4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
Apr  4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
Apr  4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
Apr  4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100

They appear once an hour.

Comment 1 frank.swasey@uvm.edu 2016-04-05 22:22:53 UTC

Today, I have built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base.  I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens.  The 2.4.42 code has been running for over two hours now (and just as in my production environment) has not failed at all.

The options I am using with configure are:

export CFLAGS="-DNDEBUG -D_REENTRANT -fPIC"
export LIBS="-lpthread"
./configure \
        -q \
        --enable-dynamic \
        --enable-static \
        --with-threads=posix \
        --with-tls=openssl \
        --with-cyrus-sasl \
        --enable-wrappers \
        --enable-passwd \
        --enable-cleartext \
        --enable-crypt \
        --enable-spasswd \
        --enable-lmpasswd \
        --enable-modules \
        --enable-ipv6 \
        \
        --enable-slapd \
        --enable-backends=mod \
        --disable-ndb \
        --disable-sql \
        --disable-perl \
        --disable-shell \
        --enable-overlays=mod


—
Frank Swasey
Systems Architecture & Administration

Comment 2 frank.swasey@uvm.edu 2016-04-05 22:34:42 UTC

Second attempt - something about my first attempt seems to have tripped the spam filter at gauss.

Today, I have built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base.  I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens.  The 2.4.42 code has been running for over two hours now (and just as in my production environment) has not failed at all.

The options I am using with configure are:

export CFLAGS="-DNDEBUG -D_REENTRANT -fPIC"
export LIBS="-lpthread"
./configure \
        -q \
        --enable-dynamic \
        --enable-static \
        --with-threads=posix \
        --with-tls=openssl \
        --with-cyrus-sasl \
        --enable-wrappers \
        --enable-passwd \
        --enable-cleartext \
        --enable-crypt \
        --enable-spasswd \
        --enable-lmpasswd \
        --enable-modules \
        --enable-ipv6 \
        \
        --enable-slapd \
        --enable-backends=mod \
        --disable-ndb \
        --disable-sql \
        --disable-perl \
        --disable-shell \
        --enable-overlays=mod


—
Frank Swasey
Systems Architecture & Administration

Comment 3 Quanah Gibson-Mount 2016-04-05 23:02:38 UTC

--On Tuesday, April 05, 2016 11:35 PM +0000 Frank.Swasey@uvm.edu wrote:


It's ok, it is trivial to run through a mime decoder.  So the regression 
was added in 2.4.43.

--Quanah



--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 4 frank.swasey@uvm.edu 2016-04-06 11:43:59 UTC

Yesterday, I built and tested 2.4.44, 2.4.43, and 2.4.42 without the normal modifications that I make to the code base.  

I have found that both 2.4.44 and 2.4.43 have this failure within an hour of starting the primary server and once an hour after it first happens.  The 2.4.42 code has been running for over fifteen hours now has not failed at all.

The script I used to build these is available here:

http://www.uvm.edu/~fcs/openldap_its/build-openldap.sh

—
Frank Swasey
Systems Architecture & Administration

Comment 5 frank.swasey@uvm.edu 2016-04-06 14:08:28 UTC

I am building a version of 2.4.43 with ITS8281 removed.  Will see if that resolves the issue.

—
Frank Swasey
Systems Architecture & Administration







-----Original Message-----
From: openldap-bugs <openldap-bugs-bounces@openldap.org> on behalf of "Frank.Swasey@uvm.edu" <Frank.Swasey@uvm.edu>
Date: Wednesday, April 6, 2016 at 7:45 AM
To: "openldap-its@openldap.org" <openldap-its@openldap.org>
Subject: Re: (ITS#8396) syncprov hourly fails to answer syncrepl

>WWVzdGVyZGF5LCBJIGJ1aWx0IGFuZCB0ZXN0ZWQgMi40LjQ0LCAyLjQuNDMsIGFuZCAyLjQuNDIg
>d2l0aG91dCB0aGUgbm9ybWFsIG1vZGlmaWNhdGlvbnMgdGhhdCBJIG1ha2UgdG8gdGhlIGNvZGUg
>YmFzZS4gIA0KDQpJIGhhdmUgZm91bmQgdGhhdCBib3RoIDIuNC40NCBhbmQgMi40LjQzIGhhdmUg
>dGhpcyBmYWlsdXJlIHdpdGhpbiBhbiBob3VyIG9mIHN0YXJ0aW5nIHRoZSBwcmltYXJ5IHNlcnZl
>ciBhbmQgb25jZSBhbiBob3VyIGFmdGVyIGl0IGZpcnN0IGhhcHBlbnMuICBUaGUgMi40LjQyIGNv
>ZGUgaGFzIGJlZW4gcnVubmluZyBmb3Igb3ZlciBmaWZ0ZWVuIGhvdXJzIG5vdyBoYXMgbm90IGZh
>aWxlZCBhdCBhbGwuDQoNClRoZSBzY3JpcHQgSSB1c2VkIHRvIGJ1aWxkIHRoZXNlIGlzIGF2YWls
>YWJsZSBoZXJlOg0KDQpodHRwOi8vd3d3LnV2bS5lZHUvfmZjcy9vcGVubGRhcF9pdHMvYnVpbGQt
>b3BlbmxkYXAuc2gNCg0K4oCUDQpGcmFuayBTd2FzZXkNClN5c3RlbXMgQXJjaGl0ZWN0dXJlICYg
>QWRtaW5pc3RyYXRpb24NCg0K
>
>
>

Comment 6 frank.swasey@uvm.edu 2016-04-06 18:57:38 UTC

I have backed out the ITS#8281 patch from my local builds of 2.4.43 and 2.4.44. They have both run without issue for two hours each.  Therefore, I’m pretty sure that the cause of the syncprov failure is that patch.  It may work well in an MMR environment, but it is failing in a single master setup.

- Frank

Comment 7 Quanah Gibson-Mount 2016-04-06 20:01:11 UTC

--On Wednesday, April 06, 2016 7:58 PM +0000 Frank.Swasey@uvm.edu wrote:

> SSBoYXZlIGJhY2tlZCBvdXQgdGhlIElUUyM4MjgxIHBhdGNoIGZyb20gbXkgbG9jYWwgYnVpb
> GRz
> IG9mIDIuNC40MyBhbmQgMi40LjQ0LiBUaGV5IGhhdmUgYm90aCBydW4gd2l0aG91dCBpc3N1Z
> SBm
> b3IgdHdvIGhvdXJzIGVhY2guICBUaGVyZWZvcmUsIEnigJltIHByZXR0eSBzdXJlIHRoYXQgd
> Ghl
> IGNhdXNlIG9mIHRoZSBzeW5jcHJvdiBmYWlsdXJlIGlzIHRoYXQgcGF0Y2guICBJdCBtYXkgd
> 29y
> ayB3ZWxsIGluIGFuIE1NUiBlbnZpcm9ubWVudCwgYnV0IGl0IGlzIGZhaWxpbmcgaW4gYSBza
> W5n bGUgbWFzdGVyIHNldHVwLg0KDQotIEZyYW5rDQoNCg0K

Hi Frank,

I'm going to see if I can reproduce this in my setup with your 
configurations.  I did notice that you don't have the syncprov overlay 
loaded on your replica, which is generally recommended.

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 8 frank.swasey@uvm.edu 2016-04-06 23:04:00 UTC

On 4/6/16, 4:01 PM, "openldap-bugs on behalf of quanah@zimbra.com" <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote:

>Hi Frank,
>
>I'm going to see if I can reproduce this in my setup with your 
>configurations.  I did notice that you don't have the syncprov overlay 
>loaded on your replica, which is generally recommended.

Quanah,

  I’ve never heard that recommendation.  Nor, do I find it in the slapo-syncprov man page or the Administrator’s guide discussion of the syncprov overlay and delta-syncrepl configuration.  Please provide a reference to that recommendation.

- Frank

Comment 9 Quanah Gibson-Mount 2016-04-06 23:20:56 UTC

--On Thursday, April 07, 2016 12:04 AM +0000 Frank Swasey 
<Frank.Swasey@uvm.edu> wrote:

> On 4/6/16, 4:01 PM, "openldap-bugs on behalf of quanah@zimbra.com"
> <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote:
>
>> Hi Frank,
>>
>> I'm going to see if I can reproduce this in my setup with your
>> configurations.  I did notice that you don't have the syncprov overlay
>> loaded on your replica, which is generally recommended.
>
> Quanah,
>
>   I've never heard that recommendation.  Nor, do I find it in the
> slapo-syncprov man page or the Administrator's guide discussion of the
> syncprov overlay and delta-syncrepl configuration.  Please provide a
> reference to that recommendation.

You just read it. ;)

--Quanah



--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 10 Quanah Gibson-Mount 2016-04-07 03:05:31 UTC

--On Thursday, April 07, 2016 12:05 AM +0000 Frank.Swasey@uvm.edu wrote:

Hi Frank,

Actually, I see this affects all forms of replication.  On my ldap servers 
it is much more frequent than hourly:

Apr  6 20:57:51 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 20:58:33 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:00:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:00:28 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:01:24 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:03:29 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:04:32 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:06:02 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:07:02 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:07:51 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:09:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:10:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:11:03 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:11:26 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:13:03 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:17:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:18:27 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:21:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:23:55 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:25:00 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:25:36 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:26:57 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:27:55 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:29:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:30:38 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:31:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:31:10 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:32:20 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:33:27 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:35:16 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:37:07 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:38:21 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:39:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:40:16 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:41:11 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:42:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:42:59 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:43:32 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:44:31 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:45:50 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:48:01 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:48:36 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:51:00 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:51:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:53:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:54:28 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:55:21 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:56:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:58:04 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 21:59:05 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 22:01:01 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 22:01:33 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 22:02:30 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003
Apr  6 22:03:41 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003

--Quanah



--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 11 Quanah Gibson-Mount 2016-04-07 03:22:01 UTC

--On Thursday, April 07, 2016 4:05 AM +0000 quanah@zimbra.com wrote:

> --On Thursday, April 07, 2016 12:05 AM +0000 Frank.Swasey@uvm.edu wrote:
>
> Hi Frank,
>
> Actually, I see this affects all forms of replication.  On my ldap
> servers  it is much more frequent than hourly:

> Apr  6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004,
> cookie=rid=102,sid=003

Interestingly, for MMR, this does not result in any data loss, as the 
changes are replicated w/o problem:

MMR node taking writes:

Apr  6 21:56:47 ldap01 slapd[34514]: slap_queue_csn: queueing 0x10bfcc80 
20160407025647.107497Z#000000#003#000000
Apr  6 21:56:47 ldap01 slapd[34514]: slap_graduate_commit_csn: removing 
0x10bfcc80 20160407025647.107497Z#000000#003#000000
Apr  6 21:56:47 ldap01 slapd[34514]: conn=2909 op=286509 RESULT tag=103 
err=0 etime=0.000935 text=
Apr  6 21:56:47 ldap01 slapd[34514]: syncprov_sendresp: to=004, 
cookie=rid=102,sid=003

MMR node receiving writes:

Apr  6 21:56:47 ldap02 slapd[61570]: do_syncrep2: rid=102 
cookie=rid=102,sid=003
Apr  6 21:56:47 ldap02 slapd[61570]: slap_queue_csn: queueing 0x80ee240 
20160407025647.107497Z#000000#003#000000
Apr  6 21:56:47 ldap02 slapd[61570]: slap_queue_csn: queueing 0x420c2c0 
20160407025647.107497Z#000000#003#000000
Apr  6 21:56:47 ldap02 slapd[61570]: syncprov_matchops: skipping original 
sid 003
Apr  6 21:56:47 ldap02 slapd[61570]: slap_graduate_commit_csn: removing 
0x420c2c0 20160407025647.107497Z#000000#003#000000
Apr  6 21:56:47 ldap02 slapd[61570]: slap_graduate_commit_csn: removing 
0x80ee240 20160407025647.107497Z#000000#003#000000

So the change is still correctly replicated to the replicating MMR node.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 12 Quanah Gibson-Mount 2016-04-07 04:04:33 UTC

--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote:

> So the change is still correctly replicated to the replicating MMR node.

I see the same behavior with the example configuration that was provided as 
well.  While the CSN bit may not be logged, the change is replicated as it 
should be.

Provider:

Apr  6 22:59:28 zre-ldap002 slapd[22212]: slap_queue_csn: queueing 
0x4ab3600 20160407035928.515652Z#000000#000#000000
Apr  6 22:59:28 zre-ldap002 slapd[22212]: slap_graduate_commit_csn: 
removing 0x4ab3600 20160407035928.515652Z#000000#000#000000
Apr  6 22:59:28 zre-ldap002 slapd[22212]: syncprov_sendresp: cookie=rid=100
Apr  6 22:59:28 zre-ldap002 slapd[22212]: conn=1003 op=1 RESULT tag=103 
err=0 etime=0.084274 text=


Replica:

Apr  6 22:59:28 zre-ldap003 slapd[7947]: do_syncrep2: rid=100 cookie=rid=100
Apr  6 22:59:28 zre-ldap003 slapd[7947]: slap_queue_csn: queueing 0x37403c0 
20160407035928.515652Z#000000#000#000000
Apr  6 22:59:28 zre-ldap003 slapd[7947]: slap_graduate_commit_csn: removing 
0x37403c0 20160407035928.515652Z#000000#000#000000
Apr  6 22:59:28 zre-ldap003 slapd[7947]: syncrepl_message_to_op: rid=100 
be_modify uid=fcs,ou=People,dc=uvm,dc=edu (0)


So while the lack of the ,csn bit is annoying... I see no actual data loss, 
etc.

--Quanah



--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 13 frank.swasey@uvm.edu 2016-04-07 12:05:20 UTC

On 4/7/16, 12:04 AM, "openldap-bugs on behalf of quanah@zimbra.com" <openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote:

>--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote:
>
>> So the change is still correctly replicated to the replicating MMR node.
>
>I see the same behavior with the example configuration that was provided as 
>well.  While the CSN bit may not be logged, the change is replicated as it 
>should be.

<snip>
>
>So while the lack of the ,csn bit is annoying... I see no actual data loss, 
>etc.

Quanah,

  I’ll go peer deeper into the logs on the consumer side.  

  If I find the same here, that indicates the provider is saying “there are changes” but not sending the CSN and syncrepl is retrieving and applying the changes, but not updating the CSN in the DSA of the consumer.  Which suggests that syncrepl needs a patch or that I should not be trusting the CSN value in the consumer’s DSA in the check_syncrepl script (although, I can’t think of another quick way to assess whether the provider and consumer are in sync - off the top of my head).  Eventually, syncprov does send the CSN and the consumer’s CSN gets updated, so things report as having caught up.

- Frank

Comment 14 frank.swasey@uvm.edu 2016-04-07 12:26:20 UTC

Quanah,

  I do see in the consumer log that it is retrieving and applying the
updates.  So the do_syncrep2 log entries also show the missing ,csn=...
condition.

  Therefore, yes, there is no delay in data the actual replication being
performed - there is a delay (as evidenced by the check of the CSN in the
DSA of the provider and consumer report) in the CSN attribute for the DSA
being updated.  Clearly, syncrepl sees the CSN's coming across so it should
be able to handle the lack of the CSN being in the response from the
provider and get the DSA's CSN updated.  I'll see if I can figure out a
patch that will work.

(sending from my personal address because I'm tired of gauss marking
everything I send as probably spam - I'm not really that suspicious a
character)

- Frank

On Thu, Apr 7, 2016 at 8:20 AM, Frank Swasey <Frank.Swasey@uvm.edu> wrote:

> On 4/7/16, 12:04 AM, "openldap-bugs on behalf of quanah@zimbra.com" <
> openldap-bugs-bounces@openldap.org on behalf of quanah@zimbra.com> wrote:
>
> >--On Thursday, April 07, 2016 4:22 AM +0000 quanah@zimbra.com wrote:
> >
> >> So the change is still correctly replicated to the replicating MMR node.
> >
> >I see the same behavior with the example configuration that was provided
> as
> >well.  While the CSN bit may not be logged, the change is replicated as it
> >should be.
>
> <snip>
> >
> >So while the lack of the ,csn bit is annoying... I see no actual data
> loss,
> >etc.
>
>
> Quanah,
>
>   I’ll go peer deeper into the logs on the consumer side.
>
>   If I find the same here, that indicates the provider is saying “there
> are changes” but not sending the CSN and syncrepl is retrieving and
> applying the changes, but not updating the CSN in the DSA of the consumer.
> Which suggests that syncrepl needs a patch or that I should not be trusting
> the CSN value in the consumer’s DSA in the check_syncrepl script (although,
> I can’t think of another quick way to assess whether the provider and
> consumer are in sync - off the top of my head).  Eventually, syncprov does
> send the CSN and the consumer’s CSN gets updated, so things report as
> having caught up.
>
> - Frank
>
>

-- 
I am not young enough to know everything. - Oscar Wilde (1854-1900)

Comment 15 frank.swasey@uvm.edu 2016-04-07 20:21:44 UTC

Quanah,

   Before I put any time into putting together a patch for syncrepl.c, I 
listened to what you said was generally recommended.  I added the 
syncprov overlay to the consumer, even though it will never be a 
provider.  I found that with the syncprov overlay after the syncrepl 
overlay in the config file, the DSA's CSN will still get out of sync. 
However, with the syncprov overlay before the syncrepl overlay - I have 
been running for the last three hours without any reports of the DSA's 
CSN being wrong on the replica (with one exception that was a check in 
the middle of an update and did not coincide with the missing csn value 
in the logs).

   Therefore, I'm going to develop a doc patch to fix the guide and man 
pages rather than a code patch.

-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Comment 16 Quanah Gibson-Mount 2016-04-07 20:53:23 UTC

--On Thursday, April 07, 2016 9:22 PM +0000 Frank.Swasey@uvm.edu wrote:

> Quanah,
>
>    Before I put any time into putting together a patch for syncrepl.c, I
> listened to what you said was generally recommended.  I added the
> syncprov overlay to the consumer, even though it will never be a
> provider.  I found that with the syncprov overlay after the syncrepl
> overlay in the config file, the DSA's CSN will still get out of sync.
> However, with the syncprov overlay before the syncrepl overlay - I have
> been running for the last three hours without any reports of the DSA's
> CSN being wrong on the replica (with one exception that was a check in
> the middle of an update and did not coincide with the missing csn value
> in the logs).
>
>    Therefore, I'm going to develop a doc patch to fix the guide and man
> pages rather than a code patch.

Hi Frank,

Great to hear!!!  I knew there was a reason behind doing so... I've done so 
for years, but the exact reasons behind it have drifted into the fog of 
time.  I did recall it was related to CSNs though, so thought that this 
might be it.

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 17 Quanah Gibson-Mount 2016-04-07 22:56:54 UTC

--On Thursday, April 07, 2016 2:53 PM -0700 Quanah Gibson-Mount 
<quanah@zimbra.com> wrote:

> Hi Frank,
>
> Great to hear!!!  I knew there was a reason behind doing so... I've done
> so for years, but the exact reasons behind it have drifted into the fog
> of time.  I did recall it was related to CSNs though, so thought that
> this might be it.

Hi Frank,

This is being caused by the syncprov checkpoint, which is why you see it 
every hour:

syncprov-checkpoint 1000 60

I had one happen in between once... I think it was because it had hit 1000 
operations, but not 100% on that.. It would imply that the operations 
counter is not being reset when the time counter syncs.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 18 Quanah Gibson-Mount 2016-04-07 23:39:45 UTC

--On Thursday, April 07, 2016 11:57 PM +0000 quanah@zimbra.com wrote:


Full summary:

the syncprov checkpoint operation causes the CSN to be lost for the first 
write operation to occur after the checkpoint.  It is important to note 
that no data is lost, all changes replicate as they should.

However, the replica CSN is not updated in this scenario, making it appear 
that the replica is out of sync with the master.  Adding the syncprov 
overlay to a replica database works around this issue by forcing the 
replica to track its internal CSNs, rather than relying on broadcasts from 
the master.

It is trivial to reproduce this issue by setting a short checkpoint 
interval with the syncprov-checkpoint parameter.

Example of the problem:

We have a script modifying the userPassword attribute of an entry every 45 
seconds.  We have a syncprov-checkpoint set to happen every 5 minutes. 
>From the log we can see:

Apr  7 18:00:38 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:05:53 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:11:09 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:16:25 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:17:55 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:21:41 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:26:57 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr  7 18:32:13 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100

Stopping the script after the 18:32:13 operation, and examining the CSN 
values on each server, we see the following.

master:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H 
ldap://zre-ldap002.eng.zimbra.com -s base -b "dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233212.979013Z#000000#000#000000

replica:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldapi:// -s base -b 
"dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233127.886702Z#000000#000#000000

Note that the CSNs are 45 seconds apart -- The interval of how often our 
writes are occurring.  So the write op /prior/ to the checkpoint is the CSN 
value that is left on the replica in this case, as it ignores the empty CSN 
syncprov send response (thus not updating its CSN).

While it is of course best practice to run the syncprov overlay on the 
replica to enforce internal CSN cohesion, it still should not be required, 
and this is clearly a bug that can cause admins to incorrectly believe that 
their servers are having replication issues.

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 19 frank.swasey@uvm.edu 2016-04-08 14:17:44 UTC

Thank you, Quanah for the clear description.

I'm not sure if the problem is in syncprov or accesslog (the two 
overlays touched by the patch for ITS 8281).  I do believe that the 
patch for ITS 8281 is the cause (since the problem goes away if that 
patch is removed).

How can I help at this point?

-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Comment 20 Quanah Gibson-Mount 2016-04-08 15:03:13 UTC

--On Friday, April 08, 2016 11:17 AM -0400 Frank Swasey 
<Frank.Swasey@uvm.edu> wrote:

> Thank you, Quanah for the clear description.
>
> I'm not sure if the problem is in syncprov or accesslog (the two overlays
> touched by the patch for ITS 8281).  I do believe that the patch for ITS
> 8281 is the cause (since the problem goes away if that patch is removed).
>
> How can I help at this point?

Hi Frank,

I would continue with the documentation update noting that it is 
recommended to run the syncprov overlay on replicas, so that they manage 
their CSN status based off updated received, rather than syncprov 
broadcasts.  It's really the correct way to configure a replica, regardless 
of this bug. ;)

I definitely wouldn't revert anything related to ITS8281, given that it was 
fixing some serious issues, whereas this issue, while annoying, doesn't 
actually cause harm, and has a workaround that lines up with best practices 
anyway.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 21 frank.swasey@uvm.edu 2016-04-08 16:12:03 UTC

Today at 11:03am, quanah@zimbra.com wrote:

> I would continue with the documentation update noting that it is
> recommended to run the syncprov overlay on replicas, so that they manage
> their CSN status based off updated received, rather than syncprov
> broadcasts.  It's really the correct way to configure a replica, regardless
> of this bug. ;)

Yeah -- so you keep saying ;)  ...  Patch attached.

> I definitely wouldn't revert anything related to ITS8281, given that it was
> fixing some serious issues, whereas this issue, while annoying, doesn't
> actually cause harm, and has a workaround that lines up with best practices
> anyway.

No, I was not saying that 8281 should be removed - just that I have 
built without the 8281 patch and the "problem" does not happen in that 
case.

Since I think we both agree that, strictly speaking, syncprov should not 
be required on a consumer only configuration; syncrepl probably needs to 
be updated to calculate the CSN if it is not present in the 
response (that's really all that is being added by having syncprov 
between syncrepl and the database - right?).  I'm not familiar with the 
code, so I don't know how difficult that would be.

-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Comment 22 Quanah Gibson-Mount 2016-04-08 16:47:23 UTC

--On Friday, April 08, 2016 1:12 PM -0400 Frank Swasey 
<Frank.Swasey@uvm.edu> wrote:

> Today at 11:03am, quanah@zimbra.com wrote:
>
>> I would continue with the documentation update noting that it is
>> recommended to run the syncprov overlay on replicas, so that they manage
>> their CSN status based off updated received, rather than syncprov
>> broadcasts.  It's really the correct way to configure a replica,
>> regardless of this bug. ;)
>
> Yeah -- so you keep saying ;)  ...  Patch attached.

Thanks!  As per <http://www.openldap.org/devel/contributing.html>, can you 
regenerate the patch as a git formatted patch, so it is properly 
attributed?  Also we will need the IPR statement as noted in 
<http://www.openldap.org/devel/contributing.html#notice>


>> I definitely wouldn't revert anything related to ITS8281, given that it
>> was fixing some serious issues, whereas this issue, while annoying,
>> doesn't actually cause harm, and has a workaround that lines up with
>> best practices anyway.
>
> No, I was not saying that 8281 should be removed - just that I have built
> without the 8281 patch and the "problem" does not happen in that case.
>
> Since I think we both agree that, strictly speaking, syncprov should not
> be required on a consumer only configuration; syncrepl probably needs to
> be updated to calculate the CSN if it is not present in the response
> (that's really all that is being added by having syncprov between
> syncrepl and the database - right?).  I'm not familiar with the code, so
> I don't know how difficult that would be.

Right, Howard's aware of the issue and will be looking into what's 
necessary to fix it. ;)

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 23 frank.swasey@uvm.edu 2016-04-08 17:01:15 UTC

git format-patch version attached.

-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Comment 24 Quanah Gibson-Mount 2016-04-08 17:17:48 UTC

--On Friday, April 08, 2016 6:02 PM +0000 Frank.Swasey@uvm.edu wrote:

>   This message is in MIME format.  The first part should be readable text,
>   while the remaining parts are likely unreadable without MIME-aware
> tools.
>
> --0-599314220-1460134885=:98440
> Content-Type: text/plain; format=flowed; charset=US-ASCII
>
> git format-patch version attached.

Perfect, thank you!

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc

Comment 25 OpenLDAP project 2017-09-11 16:34:42 UTC

has patch
has IPR

Comment 26 Quanah Gibson-Mount 2017-09-11 16:34:42 UTC

changed notes
moved from Incoming to Documentation

Comment 27 Quanah Gibson-Mount 2017-09-11 16:35:58 UTC

--On Monday, April 04, 2016 7:57 PM +0000 Frank.Swasey@uvm.edu wrote:

> The bad log entries that coincide with check_syncrepl finding the replica
> has fallen behind are like:
>
> Apr  4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
> Apr  4 13:18:27 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
> Apr  4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
> Apr  4 14:18:53 ldap7p slapd[27800]: syncprov_sendresp: cookie=rid=100
>
> They appear once an hour.

Hi Frank,

This looks like ITS#8444, which now has a fix checked into OpenLDAP master. 
You probably want to apply it to your deployment.

--Quanah


--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 28 Quanah Gibson-Mount 2020-03-22 08:16:04 UTC


*** This issue has been marked as a duplicate of issue 8444 ***