Full_Name: Mark Cairney Version: 2.4.44 (git_master) OS: Centos 6 x86_64 URL: http://homepages.ed.ac.uk/mcairney/delta-mmr-memberof.tar.gz Submission from: (NULL) (129.215.201.136) Hi, I think I've identified an issue where if you have more than 2 OpenLDAP servers set up in MMR using delta-syncrepl and push-based replication (refreshAndPersist) and the memberOf overlay. If you add/delete objects which are maintained by the memberOf overlay i.e. users and groups delta-syncrepl goes out-of-sync and triggers a SYNC_ID_SET refresh on the consumers. It looks like it receives the same change from all other nodes. This behaviour doesn't appear to happen when using pull-based replication with delta-syncrepl or standard syncrepl (push and pull-based). To confirm this behaviour I removed the memberOf overlay from the config, dumped the database, removed any "memberOf" attributes and put the database back in place (slapcat/slapadd on both cn=config and my main DB) on all 3 nodes. Without the memberof overlay loaded sync behaved normally on all 3 nodes. In the tarball I've included a sanitised copy of my config (Schema, ACLs, password hashes, indices for non-standard attributes, operational attributes all removed), a log entry showing deletion of an object not managed by memberOf (a netgroup entry), and log entries showing what happens when a group is deleted both with and without the memberOf overlay. The 3 servers are identical in both config and hardware spec apart from the olcSASLHost. They are Sun X4150's with 24GB RAM, 10KRPM HDDs in 2 pairs of RAID-1 and 2x4 CPU cores (Xeon E5540 @2.53GHz).
I sent this to the mailing list and was advised to add it to ITS 8444, so here it is: I recently updated to openldap 2.4.44 with the patch for ITS 8432. I'd been holding off in hopes of a new release with that patch included and for some update on ITS 8444, but decided to go ahead and push it through during the holiday break. I installed it on my dev environment and was unable to replicate the issues reported in ITS 8444 so went ahead and rolled into production. However, now that I'm in production, I'm seeing different issues with the memberOf overlay. It seems for some reason group membership deletions are getting replicated multiple times? In the logs, I will see something like the following: Dec 21 04:16:59 themis slapd[2607]: conn=364875 op=227806 MOD dn="uid=members,ou=group,dc=cpp,dc=edu" Dec 21 04:16:59 themis slapd[2607]: conn=364875 op=227806 MOD attr=member My identity management system connected and removed some members from this group. Next, there will be a number of lines like this: Dec 21 04:17:15 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 Dec 21 04:17:16 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 Dec 21 04:17:17 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 Dec 21 04:17:18 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 Dec 21 04:17:18 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 Dec 21 04:17:18 themis slapd[2607]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=c Where the memberOf overlay is complaining that it can't remove the corresponding memberOf attribute from the user. Reviewing the accesslog, we see: dn: reqStart=20161221121659.000002Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121659.000002Z reqEnd: 20161221121706.000000Z reqType: modify reqSession: 364875 reqAuthzID: cn=idmgmt,ou=user,ou=service,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu The initial tranaction from my client, authenticated as cn=idmgmt. This one is successful and the overlay deletes the memberOf attribute. However, there are then *six* more instances of the same transaction: dn: reqStart=20161221121714.000001Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121714.000001Z reqEnd: 20161221121715.000000Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121716.000001Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121716.000001Z reqEnd: 20161221121716.000002Z reqType: modify reqSession: 1 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121716.000004Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121716.000004Z reqEnd: 20161221121717.000002Z reqType: modify reqSession: 1 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121717.000001Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121717.000001Z reqEnd: 20161221121718.000000Z reqType: modify reqSession: 2 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121718.000001Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121718.000001Z reqEnd: 20161221121718.000002Z reqType: modify reqSession: 2 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121718.000003Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121718.000003Z reqEnd: 20161221121718.000004Z reqType: modify reqSession: 2 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu These are are performed by cn=ldaproot, and correspond to the six error messages from the overlay about failing to remove the attribute (which was already removed). Similarly, on a replica: Dec 21 04:17:16 coeus slapd[2841]: conn=-1 op=0: memberof_value_modify DN="uid=prsloan,ou=user,dc=cpp,dc=edu" delete memberOf="uid=members,ou=group,dc=cpp,dc=edu" failed err=16 There is one error, because there are two copies of the transaction: dn: reqStart=20161221121706.000001Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121706.000001Z reqEnd: 20161221121716.000000Z reqType: modify reqSession: 1 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu dn: reqStart=20161221121714.000002Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20161221121714.000002Z reqEnd: 20161221121716.000001Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=cpp,dc=edu reqDN: uid=members,ou=group,dc=cpp,dc=edu reqResult: 0 reqMod: member:- uid=jjtringali,ou=user,dc=cpp,dc=edu reqMod: member:- uid=kaijulee,ou=user,dc=cpp,dc=edu reqMod: member:- uid=srknight,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ppimentel,ou=user,dc=cpp,dc=edu reqMod: member:- uid=epdetering,ou=user,dc=cpp,dc=edu reqMod: member:- uid=ktran,ou=user,dc=cpp,dc=edu [...] reqMod: member:- uid=prsloan,ou=user,dc=cpp,dc=edu So far, I've only seem this behavior for group membership removals. Adds don't seem to cause a problem, nor do create/delete of groups or users as far as I can tell. Any thoughts on what's going on here? It's not causing any failures yet, as removing members multiple times results in the same end state and the multiple replication seems to have a fairly low upper bound. But it would be nice to fix it :). Thanks...
moved from Incoming to Software Bugs
--On Friday, December 23, 2016 2:17 AM +0000 henson@acm.org wrote: > I sent this to the mailing list and was advised to add it to ITS 8444, > so here it is: Looks like this has actually been reported before: <http://www.openldap.org/its/index.cgi/?findid=7400> --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
changed notes
On Mon, Apr 03, 2017 at 11:15:17AM -0700, Quanah Gibson-Mount wrote: > Looks like this has actually been reported before: > <http://www.openldap.org/its/index.cgi/?findid=7400> Yikes, from 2012, and never resolved 8-/. I'm curious though why Howard says the work has to be distributed to all the replicas locally as it wouldn't scale otherwise? For adding or removing a user from a group, replicating the memberOf attribute on the user in addition to the member attribute on the group only doubles the replication load which doesn't seem excessive. I suppose if you delete a group with a large membership that would result in updating a large number of users, but that's no different than if a delete a single user from a large number of groups, which doesn't seem to cause any issues. Unless I'm misunderstanding something? In any case, I suppose this ITS can be closed now too, with the same resolution that the memberof overlay is no longer supported in a replicated configuration and to switch to the dynlist implementation...
Although Quanah hasn't been able to trigger this yet, the regression test in master he wrote has been consistently able to trigger for me on my machine, so I've started to investigate. For posterity and in case anyone is interested, I have uploaded the testrun/ directory from a failing run (I suspect the fact of this laptop having a slow 2-core CPU helps) with a slightly patched slapd that records the thread ID as well since, in part, this seems like a race of some sort. The tgz is available at ftp://ftp.openldap.org/incoming/its8444-regression-testrun-sync,stat.tgz So far it looks like replica #3's threads 7f50fb7fe700 and 7f51017bb700 are both trying to apply the modification with CSN 20170605125334.856475Z#000000#001#000000 which sends it into a full refresh. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
A more self-contained log of the same issue, available at ftp://ftp.openldap.org/incoming/its8444.log (line numbers below are against current master, commit 91f4d3a6b75e73bf4ea498e83e2e4cb4e7a320e0) There are some things that occur in all the failures I have seen so far: - the server that received the operation (#1) sends the accesslog entry with no CSN in the cookie, then another provider (#2) picks up this message and relays it to its consumers, this one with a CSN in the cookie - a consumer picks up these two in short succession, in the log above, processing of the one from #2 is finished first (they are being processed concurrently) Usually, once one of them gets processed, the new CSN should be noted and the other threads should just skip it (syncrepl.c:943 and onwards). In this one, having no CSN in the cookie seems to allow both to process so far as to run syncrepl_message_to_op(), and one of them will then inevitably fail to add the entry. I don't understand yet why server #1 sends the operations without a CSN in the cookie and (especially if I reorder the overlays to set up memberof last), the race goes the other way around and the operation to fail is the one from server #2. My take on it was that in a delta-sync environment all entries would be passed with a new CSN and that should end up in the cookie, allowing syncrepl.c:986 to do its job. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Thu, Jun 08, 2017 at 06:36:02PM +0100, Ondřej Kuzník wrote: > A more self-contained log of the same issue, available at > ftp://ftp.openldap.org/incoming/its8444.log > > (line numbers below are against current master, commit > 91f4d3a6b75e73bf4ea498e83e2e4cb4e7a320e0) > > There are some things that occur in all the failures I have seen so far: > - the server that received the operation (#1) sends the accesslog entry > with no CSN in the cookie, then another provider (#2) picks up this > message and relays it to its consumers, this one with a CSN in the > cookie > - a consumer picks up these two in short succession, in the log above, > processing of the one from #2 is finished first (they are being > processed concurrently) > > Usually, once one of them gets processed, the new CSN should be noted > and the other threads should just skip it (syncrepl.c:943 and onwards). > In this one, having no CSN in the cookie seems to allow both to process > so far as to run syncrepl_message_to_op(), and one of them will then > inevitably fail to add the entry. > > I don't understand yet why server #1 sends the operations without a CSN > in the cookie and (especially if I reorder the overlays to set up > memberof last), the race goes the other way around and the operation to > fail is the one from server #2. > > My take on it was that in a delta-sync environment all entries would be > passed with a new CSN and that should end up in the cookie, allowing > syncrepl.c:986 to do its job. Using the behaviour above, I have been able to trigger a desync, the script and testrun directory from it happening are available here: ftp://ftp.openldap.org/incoming/its8444-desync.tgz In srv3, looking at cn=accesslog, we can see that the increment by 2 has been applied (and logged) twice with the same entryCSN, as: reqStart=20170718155142.000007Z,cn=accesslog reqStart=20170718155142.000009Z,cn=accesslog also seen in the log around those two DNs above. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Tue, Jul 18, 2017 at 05:27:40PM +0100, Ondřej Kuzník wrote: > Using the behaviour above, I have been able to trigger a desync, the > script and testrun directory from it happening are available here: > ftp://ftp.openldap.org/incoming/its8444-desync.tgz It's super easy to reproduce the desync when upping MMR to 6 in that script even without memberof (no process re-nicing necessary actually). -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
It is caused by the cookie not containing CSN and a race between the syncCookie check in do_syncrep2 and syncrepl_message_to_op. This race is probably fine with plain syncrepl which is idempotent, but deltasync changes get their own dn in each accesslog instance and some can be applied twice unless we know how to find out we've already seen them - they need to mention the CSN. The CSN itself gets lost on at least one occasion - when there's a checkpoint triggered. Not 100 % sure why the cookie gets eaten because of it, the op pointer is different between the syncprov_op_response that calls syncprov_checkpoint and the one that decides CSN hasn't changed. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Wed, Aug 23, 2017 at 02:42:29PM +0100, Ondřej Kuzník wrote: > It is caused by the cookie not containing CSN and a race between the > syncCookie check in do_syncrep2 and syncrepl_message_to_op. > > This race is probably fine with plain syncrepl which is idempotent, but > deltasync changes get their own dn in each accesslog instance and some > can be applied twice unless we know how to find out we've already seen > them - they need to mention the CSN. > > The CSN itself gets lost on at least one occasion - when there's a > checkpoint triggered. Not 100 % sure why the cookie gets eaten because > of it, the op pointer is different between the syncprov_op_response that > calls syncprov_checkpoint and the one that decides CSN hasn't changed. Yes, whenever a checkpoint happens, the syncCookie in cn=accesslog only contains rid=XXX,sid=YYY. I thought that was because the checkpoint results in a new accesslog entry and that would be transmitted first, but that's not the case, there is no accesslog entry nor anything sent to the client (as observed by ldapsearch -E sync=rp). I think it looks like this: syncproc_checkpoint modifies the suffix entry, that calls slap_graduate_commit_csn and the csn is removed from be_pending_csn_list. accesslog_response then can't find the CSN there and has nothing to insert into its own pending csn list. Strange that changing the overlay order (accesslog vs. syncprov) doesn't change this behaviour, something I'd expect if the above is the reason this happens. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Thu, Aug 24, 2017 at 03:55:32PM +0100, Ondřej Kuzník wrote: > I think it looks like this: syncproc_checkpoint modifies the suffix > entry, that calls slap_graduate_commit_csn and the csn is removed from > be_pending_csn_list. accesslog_response then can't find the CSN there > and has nothing to insert into its own pending csn list. Strange that > changing the overlay order (accesslog vs. syncprov) doesn't change this > behaviour, something I'd expect if the above is the reason this happens. In any case, having the checkpoint modification get its own CSN entry lets the main operation CSN entry persist, which fixes this issue and all operations that need a CSN seem to get it. Patch is available at: ftp://ftp.openldap.org/incoming/Ondrej-Kuznik-20170825-ITS8444.patch -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
changed notes changed state Open to Test
changed notes changed state Test to Release
Fixed in master (9e156bf91467da256ade7842b03b0aa6f9b7405c) Fixed in RE24 (2.4.46)
changed notes changed state Release to Closed
*** Issue 8396 has been marked as a duplicate of this issue. ***