Issue 9282 - Syncrepl re-creates deleted entry
Summary: Syncrepl re-creates deleted entry
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.50
Hardware: All All
: --- normal
Target Milestone: 2.4.53
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-29 19:08 UTC by Quanah Gibson-Mount
Modified: 2021-12-13 16:41 UTC (History)
1 user (show)

See Also:


Attachments
config for node A (1.85 KB, text/plain)
2020-06-29 19:18 UTC, Quanah Gibson-Mount
Details
config for node B (1.85 KB, text/plain)
2020-06-29 19:19 UTC, Quanah Gibson-Mount
Details
Database used (149.91 KB, application/x-gzip)
2020-06-29 19:19 UTC, Quanah Gibson-Mount
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Quanah Gibson-Mount 2020-06-29 19:08:40 UTC
Scenario:

2 node Multi-provider replication

Add database to provider A

ensure database replicates to provider B

Stop provider A

delete entry on provider B

Start provider A

Wait for provider B to reconnect to provider A

Deleted entry re-appears
Comment 1 Quanah Gibson-Mount 2020-06-29 19:18:58 UTC
Created attachment 741 [details]
config for node A
Comment 2 Quanah Gibson-Mount 2020-06-29 19:19:14 UTC
Created attachment 742 [details]
config for node B
Comment 3 Quanah Gibson-Mount 2020-06-29 19:19:33 UTC
Created attachment 743 [details]
Database used
Comment 4 Quanah Gibson-Mount 2020-06-29 19:24:02 UTC
General steps to reproduce:

slapadd server1 config on node A
slapadd server2 config on node B

start server A, start server B

ldapadd example database on node A

confirm DB has replicated to both nodes, and that entry to be deleted exists:

ldapsearch -H ldap://server1.example.com -x -D cn=admin,dc=example,dc=com -w secret -b dc=example,dc=com "(objectClass=*)" 1.1 | grep ^dn: | wc -l

Should be: 1011

ldapsearch -H ldap://server2.example.com -x -D cn=admin,dc=example,dc=com -w secret -b dc=example,dc=com "(objectClass=*)" 1.1 | grep ^dn: | wc -l

Should be: 1011

ldapsearch -x -H ldap://server1.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1

Should return entry DN:
dn: cn=Damon Leeson,ou=Product Development,dc=example,dc=com

ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1
dn: cn=Damon Leeson,ou=Product Development,dc=example,dc=com


Stop node A

Delete entry from node B:

ldapdelete -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Damon Leeson, ou=Product Development, dc=example,dc=com"

Confirm entry is gone from node B:
ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1

<nothing>

Start node A

Wait for replication to re-initiate from node B to node A

Search for entry in both nodes.

ldapsearch -x -H ldap://server1.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1

ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1

Expected: Entry is not in either node

Actual: Entry is in both nodes
Comment 5 Quanah Gibson-Mount 2020-06-30 20:43:34 UTC
Confirmed same issue exists in master after writing regression test
Comment 6 Ondřej Kuzník 2020-07-01 10:29:33 UTC
Thanks for the reproducer script.

This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/syncrepl.c#L1638 causing A to skip the present cull.

Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either:
- a SID disappeared from the set we received, which sounds like what ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen
- a SID is added to the set by the provider, like here. This could be due to a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent

The above would not explain why server B then receives the deleted entry back rather than this being a silent desync. It turns out that check_syncprov() doesn't add SID 2 into the cookie[0] so it forgets its own modification when establishing the syncrepl session to A.

Howard, can you review if any of the claims above seem wrong?

[0]. https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/syncrepl.c#L1638 - the loops should probably be inverted, with the outer loop operating on si_cookieState instead?
Comment 7 Howard Chu 2020-07-02 13:19:40 UTC
(In reply to Ondřej Kuzník from comment #6)
> Thanks for the reproducer script.
> 
> This is due to
> https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/
> syncrepl.c#L1638 causing A to skip the present cull.
> 
> Based on the git history, this was introduced to deal with ITS#5470 but that
> seems wrong, if the number of SIDs in the cookie differs from what we
> requested then either:
> - a SID disappeared from the set we received, which sounds like what
> ITS#5470 is about? But slapd doesn't really allow this at the moment as it
> will say consumer is newer than provider) so that shouldn't happen

A SID can't disappear. They tend to stay in the contextCSN forever. (This is actually another problem, nodes that are converted from single-provider to multi-provider generally still have a SID 0 CSN, which is always ancient relative to the active SIDs. Routines that check for oldest CSN to still exist in the DB lead to wasteful checks because of that. Right now all you can do is use mage privs and delete the obsolete CSN.)

> - a SID is added to the set by the provider, like here. This could be due to
> a delete (like here) and that delete has to be replicated - that is the
> point of running syncrepl_del_nonpresent

Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.

I think a proper fix would require a change in the syncrepl protocol sequencing. E.g., two nodes should refresh from each other with all of their new Adds/Modifies first, and once those changes have been settled, then they can perform a present cross-check. This would also require saving some intermediate cookie state in case the the full sequence gets interrupted.

Or, put in another way, there needs to be a separately tracked contextDeleteCSN.

> The above would not explain why server B then receives the deleted entry
> back rather than this being a silent desync. It turns out that
> check_syncprov() doesn't add SID 2 into the cookie[0] so it forgets its own
> modification when establishing the syncrepl session to A.
> 
> Howard, can you review if any of the claims above seem wrong?
> 
> [0].
> https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/
> syncrepl.c#L1638 - the loops should probably be inverted, with the outer
> loop operating on si_cookieState instead?
Comment 8 Ondřej Kuzník 2020-07-02 14:22:43 UTC
On Thu, Jul 02, 2020 at 01:19:40PM +0000, openldap-its@openldap.org wrote:
> --- Comment #7 from Howard Chu <hyc@openldap.org> ---
> (In reply to Ondřej Kuzník from comment #6)
>> Thanks for the reproducer script.
>> 
>> This is due to
>> https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/
>> syncrepl.c#L1638 causing A to skip the present cull.
>> 
>> Based on the git history, this was introduced to deal with ITS#5470 but that
>> seems wrong, if the number of SIDs in the cookie differs from what we
>> requested then either:
>> - a SID disappeared from the set we received, which sounds like what
>> ITS#5470 is about? But slapd doesn't really allow this at the moment as it
>> will say consumer is newer than provider) so that shouldn't happen
> 
> A SID can't disappear. They tend to stay in the contextCSN forever. (This is
> actually another problem, nodes that are converted from single-provider to
> multi-provider generally still have a SID 0 CSN, which is always ancient
> relative to the active SIDs. Routines that check for oldest CSN to still exist
> in the DB lead to wasteful checks because of that. Right now all you can do is
> use mage privs and delete the obsolete CSN.)

Yeah, and it would not be so wasteful if we could query the database for
the oldest/newest entry with a given SID in entryCSN. Removing a SID
from the set is always going to be a manual operation unless we can
coordinate with all provider and consumer nodes somehow.

>> - a SID is added to the set by the provider, like here. This could be due to
>> a delete (like here) and that delete has to be replicated - that is the
>> point of running syncrepl_del_nonpresent
> 
> Yes, the problem that was being addressed is that if the local node knows about
> more SIDs than the remote node, then the incoming present list from the remote
> node can't be trusted. Doing a del_nonpresent could delete a lot of entries
> that the remote node doesn't know about, but exist legitimately on the local
> node.

The scenario I describe here is if we start a search with a cookie
containing only SIDs {1, 2} but finish present phase by receiving a
cookie with SIDs {1, 2, 3}. Accepting that cookie implies we have to
process the (implied) deletes too or we have desynced.

If, in the meantime, we added entries with a SID of 4, those are not
part of the original cookie and should not be deleted, that's for sure.
I think we do the right thing already or are close to doing so.

> I think a proper fix would require a change in the syncrepl protocol
> sequencing. E.g., two nodes should refresh from each other with all of their
> new Adds/Modifies first, and once those changes have been settled, then they
> can perform a present cross-check. This would also require saving some
> intermediate cookie state in case the the full sequence gets interrupted.
> 
> Or, put in another way, there needs to be a separately tracked
> contextDeleteCSN.

That's ITS#8125 work, I should get back to that eventually.
Comment 9 Howard Chu 2020-07-02 15:31:14 UTC
(In reply to Howard Chu from comment #7)
> (In reply to Ondřej Kuzník from comment #6)
> > - a SID is added to the set by the provider, like here. This could be due to
> > a delete (like here) and that delete has to be replicated - that is the
> > point of running syncrepl_del_nonpresent
> 
> Yes, the problem that was being addressed is that if the local node knows
> about more SIDs than the remote node, then the incoming present list from
> the remote node can't be trusted. Doing a del_nonpresent could delete a lot
> of entries that the remote node doesn't know about, but exist legitimately
> on the local node.

More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15

We could try changing del_nonpresent to ignore entries with entryCSN newer than the current remote cookie, instead of ignoring the entire presentlist.
Comment 10 Ondřej Kuzník 2020-07-03 08:55:20 UTC
On Thu, Jul 02, 2020 at 03:31:14PM +0000, openldap-its@openldap.org wrote:
> --- Comment #9 from Howard Chu <hyc@openldap.org> ---
> (In reply to Howard Chu from comment #7)
>> Yes, the problem that was being addressed is that if the local node knows
>> about more SIDs than the remote node, then the incoming present list from
>> the remote node can't be trusted. Doing a del_nonpresent could delete a lot
>> of entries that the remote node doesn't know about, but exist legitimately
>> on the local node.
> 
> More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15
> 
> We could try changing del_nonpresent to ignore entries with entryCSN newer than
> the current remote cookie, instead of ignoring the entire presentlist.

And that's what you've touched on in ITS#5470
(4673c99e96fdc70ef96b84a9aa4de6141f26e6df) already but had to partially
revert in ac037d3a13ce56f72ef24c8e17fc944c39c71e72 since there is still
no way to LE compare only within the same SID, but I repeat myself.
Comment 11 Ondřej Kuzník 2020-07-07 13:10:43 UTC
(In reply to Howard Chu from comment #9)
> (In reply to Howard Chu from comment #7)
>> Yes, the problem that was being addressed is that if the local node knows
>> about more SIDs than the remote node, then the incoming present list from
>> the remote node can't be trusted. Doing a del_nonpresent could delete a lot
>> of entries that the remote node doesn't know about, but exist legitimately
>> on the local node.
> 
> More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15
> 
> We could try changing del_nonpresent to ignore entries with entryCSN newer
> than the current remote cookie, instead of ignoring the entire presentlist.

The top commit in https://git.openldap.org/openldap/openldap/-/merge_requests/91 attempts to address this very issue by checking whether its entryCSN is covered by the cookie we received.
Comment 12 Quanah Gibson-Mount 2020-07-22 18:56:54 UTC
Commits: 
  • 9e736dbf 
by Quanah Gibson-Mount at 2020-07-22T18:24:51+00:00 
ITS#9282 regression test


  • 5bbcf38c 
by Ondřej Kuzník at 2020-07-22T18:24:51+00:00 
ITS#9282 Build a complete cookie for the search


  • 521b8bbe 
by Ondřej Kuzník at 2020-07-22T18:24:52+00:00 
ITS#9282 Check entries are covered by new contextCSN before deletion
Comment 13 Quanah Gibson-Mount 2020-07-23 17:05:22 UTC
RE24:

Commits: 
  • 7fb8912b 
by Quanah Gibson-Mount at 2020-07-23T15:57:22+00:00 
ITS#9282 regression test


  • 939404ac 
by Ondřej Kuzník at 2020-07-23T16:53:46+00:00 
ITS#9282 Build a complete cookie for the search


  • a34ecd86 
by Ondřej Kuzník at 2020-07-23T17:02:29+00:00 
ITS#9282 Check entries are covered by new contextCSN before deletion
Comment 14 Quanah Gibson-Mount 2020-08-31 19:53:37 UTC
Commits: 
  • 8699e5f3 
by Howard Chu at 2020-08-31T19:36:10+01:00 
ITS#9282 fix crash in nonpresent_callback

In a standard Refresh present phase, the provider sends no cookie
since it is only listing the entries that existed as of the time
in the cookie the consumer sent. In this case the consumer only
needs to check entryCSNs against its last sent cookie.


  • 41396248 
by Howard Chu at 2020-08-31T20:09:52+01:00 
ITS#9282 more for merge_state

Don't assume si_cookieState is always newer
Comment 15 Quanah Gibson-Mount 2021-02-18 18:42:55 UTC
Commits: 
  • ee564399 
by Ondřej Kuzník at 2021-02-18T17:31:32+00:00 
ITS#9282 Check all csns
Comment 16 Quanah Gibson-Mount 2021-02-18 18:46:41 UTC
RE24:

Commits: 
  • f2a6425c 
by Ondřej Kuzník at 2021-02-18T18:43:44+00:00 
ITS#9282 Check all csns
Comment 17 Quanah Gibson-Mount 2021-12-13 16:39:17 UTC
head:

Commits: 
  • da610c05 
by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 
ITS#9282 Short-circuit cookie comparison in non-present check


  • 8d428f31 
by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 
ITS#9282 Do not resuscitate entries we already deleted


  • a73ddda5 
by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 
ITS#9282 Skip old accesslog entries even in delta-refresh
Comment 18 Quanah Gibson-Mount 2021-12-13 16:41:28 UTC
RE26:

  • 8507e711 
by Ondřej Kuzník at 2021-12-13T16:38:23+00:00 
ITS#9282 Short-circuit cookie comparison in non-present check


  • 68981147 
by Ondřej Kuzník at 2021-12-13T16:38:28+00:00 
ITS#9282 Do not resuscitate entries we already deleted


  • 6f7dccd5 
by Ondřej Kuzník at 2021-12-13T16:38:33+00:00 
ITS#9282 Skip old accesslog entries even in delta-refresh
Comment 19 Quanah Gibson-Mount 2021-12-13 16:41:37 UTC
RE25:

  • 11a86733 
by Ondřej Kuzník at 2021-12-13T16:40:47+00:00 
ITS#9282 Short-circuit cookie comparison in non-present check


  • 9bd3fa40 
by Ondřej Kuzník at 2021-12-13T16:40:50+00:00 
ITS#9282 Do not resuscitate entries we already deleted


  • f1ce0d20 
by Ondřej Kuzník at 2021-12-13T16:40:52+00:00 
ITS#9282 Skip old accesslog entries even in delta-refresh