8607 – Second master in MMR config logging contextCSN changes to accesslog

Issue 8607 - Second master in MMR config logging contextCSN changes to accesslog

Summary: Second master in MMR config logging contextCSN changes to accesslog

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-03-02 18:46 UTC by mberg@synacor.com
Modified:	2018-03-22 19:25 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mberg@synacor.com 2017-03-02 18:46:32 UTC

Full_Name: Matthew Berg
Version: 2.4.43+
OS: CentOS 6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (38.97.88.240)


While debugging replication issues in a Zimbra installation, I discovered a
significant number of accesslog entries with an empty reqDN and specifying
values for contextCSN; e.g.

dn: reqStart=20170302180943.186147Z,cn=accesslog
objectClass: auditModify
structuralObjectClass: auditModify
reqStart: 20170302180943.186147Z
reqEnd: 20170302180943.190880Z
reqType: modify
reqSession: 100
reqAuthzID: cn=config
reqDN:
reqResult: 0
reqMod: contextCSN:= 20170227211132.933390Z#000000#000#000000
reqMod: contextCSN:= 20170302180943.182601Z#000000#001#000000
reqMod: contextCSN:= 20170302175436.282737Z#000000#002#000000
entryUUID: 27b9f34c-93bf-1036-9636-3fa0e78444ec
creatorsName: cn=config
createTimestamp: 20170302180943Z
entryCSN: 20170302180943.182601Z#000000#001#000000
modifiersName: cn=config
modifyTimestamp: 20170302180943Z

I was able to reproduce this behavior in a test lab with two freshly installed
servers and no existing data.  Given two servers, A and B, a write to A will
result in the expected audit object being written to the access log on A, but
when B applies the change it logs an additional auditModify (likewise B will
cause the same additional entry on A).

This particular environment was running a patched build of 2.4.44.  I audited
our other production environments and did not see the same behavior with prior
builds (primarily 2.4.39).

In order to rule out the possibility that this was introduced by any patched
Zimbra applied I built out a new package from stock OpenLDAP sources and was
able to reproduce with both 2.4.44 and 2.4.43, but not with 2.4.42 and earlier.

Comment 1 Quanah Gibson-Mount 2017-03-02 19:15:23 UTC

--On Thursday, March 02, 2017 6:46 PM +0000 mberg@synacor.com wrote:

> In order to rule out the possibility that this was introduced by any
> patched Zimbra applied I built out a new package from stock OpenLDAP
> sources and was able to reproduce with both 2.4.44 and 2.4.43, but not
> with 2.4.42 and earlier.

Hi Matt,

I don't see any indication of a bug here.  Zimbra uses "" (empty DN) as its 
root, which shares other responsibilities with the overal DIT.  So it's 
simply logging an update of the contextCSN to its root entry.  This change 
in behavior was made to resolve the issue I reported in ITS#8281.

Closing this report.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 2 Quanah Gibson-Mount 2017-03-02 19:15:53 UTC

changed notes
changed state Open to Closed

Comment 3 mberg@synacor.com 2017-03-02 22:36:24 UTC

Storing the update to the contextCSN to the root object makes sense,
but propagating that update through the accesslog?  Even if it were
intended behavior, shouldn't the server receiving the update directly
generate a similar access log entry, rather than just the server
applying it from the accesslog.

But practically speaking, the contextCSN on a given server should
presumably reflect the most recent entryCSN that particular server has
applied, and this would seem to break that.

To illustrate I added a third contextCSN to my lab servers, with a
stale CSN on ldap01-a:

  [zimbra@ldap01-a ~ldapsearch -LLL -D uid=zimbra,cn=admins,cn=zimbra -H ldapi:/// -w ******** -s base contextCSN
  dn:
  contextCSN: 20170302203943.550330Z#000000#001#000000
  contextCSN: 20170302175436.282737Z#000000#002#000000
  contextCSN: 20160302175436.282737Z#000000#003#000000
 
  [zimbra@ldap01-b ~]$ ldapsearch -LLL -D uid=zimbra,cn=admins,cn=zimbra -H ldapi:/// -w ******** -s base contextCSN
  dn:
  contextCSN: 20170302203943.550330Z#000000#001#000000
  contextCSN: 20170302175436.282737Z#000000#002#000000
  contextCSN: 20170302175436.282737Z#000000#003#000000

Wrote a new entry on ldap01-a (which is SID 001).  The change
propagated to ldap01-b, wrote an auditModify to the accesslog with its
new contextCSNs:

  dn: reqStart=20170302214742.843377Z,cn=accesslog
  objectClass: auditModify
  structuralObjectClass: auditModify
  reqStart: 20170302214742.843377Z
  reqEnd: 20170302214742.844297Z
  reqType: modify
  reqSession: 100
  reqAuthzID: cn=config
  reqDN:
  reqResult: 0
  reqMod: contextCSN:= 20170302214737.049475Z#000000#001#000000
  reqMod: contextCSN:= 20170302175436.282737Z#000000#002#000000
  reqMod: contextCSN:= 20170302175436.282737Z#000000#003#000000
  reqEntryUUID: e1475404-93d7-1036-989f-315f78f5905f
  entryUUID: 9bceae9e-93dd-1036-8a51-85a9c542cb5f
  creatorsName: cn=config
  createTimestamp: 20170302214737Z
  entryCSN: 20170302214737.049475Z#000000#001#000000
  modifiersName: cn=config
  modifyTimestamp: 20170302214737Z

Which was subsequently read by ldap01-a:

  Mar  2 16:47:42 ldap01-a slapd[9577]: syncprov_search_response: cookie=rid=100,sid=001,csn=20170302214737.049475Z#000000#001#000000

And now it has an updated contextCSN for 003 despite receiving no update for that SID:

  [zimbra@ldap01-a ~]$ ldapsearch -LLL -D uid=zimbra,cn=admins,cn=zimbra -H ldapi:/// -w ******** -s base contextCSN
  dn:
  contextCSN: 20170302214737.049475Z#000000#001#000000
  contextCSN: 20170302175436.282737Z#000000#002#000000
  contextCSN: 20170302175436.282737Z#000000#003#000000

I'm not sure if there is any guard against writing an older contextCSN
than is currently stored, but this could theoretically rewind the
stored contextCSN on A if there is any latency in replication from B.

This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.

Comment 4 Quanah Gibson-Mount 2017-03-03 00:49:21 UTC

--On Thursday, March 02, 2017 10:36 PM +0000 mberg@synacor.com wrote:

> I'm not sure if there is any guard against writing an older contextCSN
> than is currently stored, but this could theoretically rewind the
> stored contextCSN on A if there is any latency in replication from B.

Older contextCSNs will be ignored.  And it is accurate to then have a 
contextCSN for each server in the system.  That was part of the initial 
problem in 8281.  I still see no evidence of incorrect behavior. ;)

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 5 mberg@synacor.com 2017-03-03 21:36:25 UTC

On Thu, 2017-03-02 at 16:49 -0800, Quanah Gibson-Mount wrote:
> --On Thursday, March 02, 2017 10:36 PM +0000 mberg@synacor.com wrote:
> 
> > I'm not sure if there is any guard against writing an older
> > contextCSN
> > than is currently stored, but this could theoretically rewind the
> > stored contextCSN on A if there is any latency in replication from
> > B.
> 
> Older contextCSNs will be ignored.  And it is accurate to then have
> a contextCSN for each server in the system.  That was part of the
> initial problem in 8281.  I still see no evidence of incorrect
> behavior. ;)

I'm well aware that there will be a contextCSN for each SID in the
system.  The questionable part was the apparent overwriting of the
contextCSN on another server.  I seem to be unable to reproduce that
today, so I'm not sure if I fat-fingered something.

However there would seem to be a couple possibilities here:

  1) The contextCSN written as an accesslog entry is read by the other
     server and applied, which would be wrong since it should represent
     the most recent entryCSN seen for that SID, or

  2) The contextCSN written as an accesslog entry is read by the other
     server and ignored, which would waste IOPS and increase latency, 
     or

  3) The contextCSN written as an accesslog entry has some non-obvious
     function, and replicas reading from the active master would be at
     a disadvantage since these entries are only being written on the
     passive master.

I'm not precisely sure how this relates to 8281, since that seemed to
be fixed by not generating an initial contextCSN for newly initialized
databases and telling consumers to smeg off if they tried to connect
before a contextCSN was written (presumably by a successful REFRESH in
a newly initialized MMR provider, or an initial write in a single-
master provider).
This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.

Comment 6 Quanah Gibson-Mount 2017-03-03 21:57:25 UTC

--On Friday, March 03, 2017 9:36 PM +0000 mberg@synacor.com wrote:

> I'm not precisely sure how this relates to 8281, since that seemed to
> be fixed by not generating an initial contextCSN for newly initialized
> databases and telling consumers to smeg off if they tried to connect
> before a contextCSN was written (presumably by a successful REFRESH in
> a newly initialized MMR provider, or an initial write in a single-
> master provider).

See 
<http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commitdiff;h=cd8ff37629012c1676ef79de164a159da9b2ae89>, 
where the code was clearly modified to push the contextCSN's through where 
before they were not.  The entire issue around ITS#8281 was REFRESH mode in 
MMR failing if the server was stopped and restarted midstream while doing 
the refresh, with the wrong CSN value being stored.  The changes in 
ITS#8281 ensure it maintains the correct CSN value internally.

Your context for points 1-3 are invalid as the entire issue was around 
N-way MMR (and their individual serverID contextCSNs) so non-master 
replicas are beside the point.  In addition, a correct setup via the most 
recent tools would include replicas pulling from all masters, vs a single 
master.  So there is no replica only pulling from an "active" or "passive" 
master in a correct setup.  Replicas pull from *all* masters when deployed 
correctly.  See also <https://bugzilla.zimbra.com/show_bug.cgi?id=95200>

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 7 Quanah Gibson-Mount 2017-03-03 22:38:30 UTC

--On Friday, March 03, 2017 9:57 PM +0000 quanah@symas.com wrote:

> Your context for points 1-3 are invalid as the entire issue was around
> N-way MMR (and their individual serverID contextCSNs) so non-master
> replicas are beside the point.  In addition, a correct setup via the most
> recent tools would include replicas pulling from all masters, vs a single
> master.  So there is no replica only pulling from an "active" or
> "passive"  master in a correct setup.  Replicas pull from *all* masters
> when deployed  correctly.  See also
> <https://bugzilla.zimbra.com/show_bug.cgi?id=95200>

For background on ITS#8281, the context was:

2 (or more) MMR nodes fully populated with X number of entries.

Add a new MMR node (with no database populated)

While the MMR node is in REFRESH mode, stop/start slapd on that node.  The 
interuption in the REFRESH phase results in the DB not fully replicating 
(Stopped at the point in which slapd was stopped).  This is clearly 
described in the initial report, where only 625/15000 users got replicated 
due to stopping slapd.  The discussion about what to do when there's no 
contextCSN generated yet on a new system, where it is single 
master/consumer was an orthogonal but related discussion, in relation to 
the value of generating a contextCSN at startup or not.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 8 mberg@synacor.com 2017-03-06 17:24:04 UTC

Reading back over the report on ITS-8281, the reported behavior would
have been a collision of generating a new contextCSN as well as
duplicating the producer's SID. So it's generated contextCSN would
have masked the one it never committed from the original master.

However, the point still stands that the contextCSN entry is *not*
being written to the access log by the first server (A) here. It is
only being written by the second server (B).

So in the scenario that is supposed to be addressed by the fix - a
second MMR node with an uninitialized database is in its initial
refresh, it wouldn't be receiving contextCSN updates through the access
log because they simply aren't there.

To me the logical behaviour would be that a new node would remain
without a recorded contextCSN until it receives the sync cookie along
with the refreshDone message. Sending an explicit contextCSN update
after every single accesslog entry seems an awfully expensive and
unintuitive approach to fixing a refresh issue.

TL;DR: The objection raised in scenario #3 stands if you are
initializing a new MMR node, since. Or in a single master + replica
scenario.

Or to rephrase, if these entries are intended and necessary, shouldn't
they be generated on the original server as well?
This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.

Comment 9 Quanah Gibson-Mount 2017-03-06 18:03:39 UTC

--On Monday, March 06, 2017 5:24 PM +0000 mberg@synacor.com wrote:

> Reading back over the report on ITS-8281, the reported behavior would
> have been a collision of generating a new contextCSN as well as
> duplicating the producer's SID.  So it's generated contextCSN would
> have masked the one it never committed from the original master.

I'm not sure what you mean by duplicating the producer's SID.  Both nodes 
had unique SID values.  There were no duplicated SIDs.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 10 mberg@synacor.com 2017-03-13 15:55:10 UTC

"After investigating the server setup, there are a few problems here.
The new server was being configured with sid=001 which was already
assigned to the original master. That's clearly going to screw things
up."


On Mon, 2017-03-06 at 10:03 -0800, Quanah Gibson-Mount wrote:
> --On Monday, March 06, 2017 5:24 PM +0000 mberg@synacor.com wrote:
> 
> > Reading back over the report on ITS-8281, the reported behavior
> > would have been a collision of generating a new contextCSN as well
> > as duplicating the producer's SID.  So it's generated contextCSN
> > would have masked the one it never committed from the original
> > master.
> 
> I'm not sure what you mean by duplicating the producer's SID.  Both
> nodes 
> had unique SID values.  There were no duplicated SIDs.
> 
> --Quanah
> 
> --
> 
> Quanah Gibson-Mount
> Product Architect
> Symas Corporation
> Packaged, certified, and supported LDAP solutions powered by
> OpenLDAP:
> <http://www.symas.com>
> 
-- 
Matthew Berg <mberg@synacor.com>
This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.

Comment 11 Quanah Gibson-Mount 2017-03-13 16:15:54 UTC

--On Monday, March 13, 2017 12:55 PM -0400 Matthew Berg <mberg@synacor.com> 
wrote:

> "After investigating the server setup, there are a few problems here.
> The new server was being configured with sid=001 which was already
> assigned to the original master. That's clearly going to screw things
> up."

I'm aware of what Howard's original response was.  That has nothing to do 
with what the problem was, or the later 100% reproducible test I provided 
him that did not have that problem.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 12 Quanah Gibson-Mount 2017-03-27 18:05:05 UTC

changed notes
changed state Closed to Open
moved from Incoming to Software Bugs

Comment 13 Quanah Gibson-Mount 2017-03-27 22:44:51 UTC

--On Monday, March 13, 2017 5:16 PM +0000 quanah@symas.com wrote:

> --On Monday, March 13, 2017 12:55 PM -0400 Matthew Berg
> <mberg@synacor.com> =
>
> wrote:
>
>> "After investigating the server setup, there are a few problems here.
>> The new server was being configured with sid=3D001 which was already
>> assigned to=C2=A0the original master. That's clearly going to screw =
> things
>> up."
>
> I'm aware of what Howard's original response was.  That has nothing to
> do=20 with what the problem was, or the later 100% reproducible test I
> provided=20 him that did not have that problem.

Hi Matt,

Just wanted to let you know I followed up with Howard on this, and he's 
going to look at tweaking it somewhat to ensure that it doesn't have the 
potential to cause follow on problems.  I've reopened and moved it to 
software bugs.  I'll let you know once the code change is available.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 14 Howard Chu 2018-02-08 00:28:59 UTC

changed notes
changed state Open to Test

Comment 15 Quanah Gibson-Mount 2018-02-09 17:54:45 UTC

changed notes
changed state Test to Release

Comment 16 OpenLDAP project 2018-03-22 19:25:32 UTC

fixed in master
fixed in 2.4.46

Comment 17 Quanah Gibson-Mount 2018-03-22 19:25:32 UTC

changed notes
changed state Release to Closed