Issue 9823 - syncprov doesn't fallback when deltasync consumer's offline beyond accesslog depth
Summary: syncprov doesn't fallback when deltasync consumer's offline beyond accesslog ...
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.6.1
Hardware: All All
: --- normal
Target Milestone: 2.5.13
Assignee: Ondřej Kuzník
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-13 14:38 UTC by Shawn McKinney
Modified: 2024-02-15 18:19 UTC (History)
0 users

See Also:


Attachments
provider slapd.conf (2.71 KB, text/plain)
2022-04-18 18:31 UTC, Shawn McKinney
Details
consumer slapd.conf (3.10 KB, text/plain)
2022-04-18 18:32 UTC, Shawn McKinney
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Shawn McKinney 2022-04-13 14:38:11 UTC
Configured w/ deltasync. When a consumer goes offline for a duration exceeding the the logpurge interval, won't fallback into syncrepl, resulting in a dsync.
Comment 1 Shawn McKinney 2022-04-18 18:31:37 UTC
Created attachment 893 [details]
provider slapd.conf
Comment 2 Shawn McKinney 2022-04-18 18:32:04 UTC
Created attachment 894 [details]
consumer slapd.conf
Comment 3 Shawn McKinney 2022-04-18 18:35:45 UTC
# Instructions to reproduce

1. Use openldap_version: 'OPENLDAP_REL_ENG_2_6'
2. Setup one provider, one consumer, delta sync repl (conf attached)
3. Set on provider: logpurge 00+00:02 00+00:01
4. Add some records (batch #1)
5. Stop the consumer
6. Add some more records (batch #2)
7. Wait 3 minutes
8. Start the consumer
9. Measure entry count. Consumer won't receive the 2nd batch of records
Comment 4 Shawn McKinney 2022-04-18 18:49:36 UTC
Note:
Same behavior applies when consumer is also a provider (i.e. multi-provider). If a delta sync consumer's offline for longer than the purge interval of its providers, it won't receive the updates corresponding with those purged records.

The question, why doesn't it fallback into plain sync repl? Or, given some indication to the consumer (error) that it can't be brought back in sync, i.e. dsync has occurred.
Comment 5 Ondřej Kuzník 2022-04-21 10:33:40 UTC
On Mon, Apr 18, 2022 at 06:49:36PM +0000, openldap-its@openldap.org wrote:
> --- Comment #4 from Shawn McKinney <smckinney@symas.com> ---
> The question, why doesn't it fallback into plain sync repl? Or, given some
> indication to the consumer (error) that it can't be brought back in sync, i.e.
> dsync has occurred.

Syncprov just lets the consumer replicate the current contents of the
database (minus any deletions because syncprov-nopresent is set). It has
no idea that deletes happened (there is no record of them) and how it
all fits into the semantics of delta-syncrepl.

We could teach syncprov about minCSN (as maintained by slapo-accesslog)
when nopresent is set but then we should really rename the parameter to
something else, more in line with the intended usage.

Another thing to keep in mind if we go that route is that minCSN would
now have two slightly different uses:
- as an indication of how useful the log is as a source of a refresh
  delete phase
- as an indication whether the accesslog is useful as a replication log
  for deltasync
  
A much tighter set of assumptions is associated with the latter. In
general, whenever a main DB runs a plain refresh, this changes what part
of the accesslog is usable as a deltasync source[0] while its usefulness
to serve as a sessionlog source is unaffected.

[0]. A plain refresh destroys ordering information so anything before it
has finished is suspect for deltasync. Currently we ignore that, see
ITS#9580 for more background
Comment 6 Howard Chu 2022-04-21 12:50:34 UTC
(In reply to Ondřej Kuzník from comment #5)
> On Mon, Apr 18, 2022 at 06:49:36PM +0000, openldap-its@openldap.org wrote:
> > --- Comment #4 from Shawn McKinney <smckinney@symas.com> ---
> > The question, why doesn't it fallback into plain sync repl? Or, given some
> > indication to the consumer (error) that it can't be brought back in sync, i.e.
> > dsync has occurred.
> 
> Syncprov just lets the consumer replicate the current contents of the
> database (minus any deletions because syncprov-nopresent is set). It has
> no idea that deletes happened (there is no record of them) and how it
> all fits into the semantics of delta-syncrepl.

Deletes are irrelevant when polling the log. The log is a queue, appends at the tail and deletes from the head. The only check that's required is to see if the consumer's cookieCSNs are still present in the log. If not, then records that cover the cookie are gone, and a refresh from the mainDB is needed. That's why we use the nopresent config on the logDB, because a normal present phase is a bunch of work for no extra benefit.
Comment 7 Ondřej Kuzník 2022-04-21 12:58:41 UTC
On Thu, Apr 21, 2022 at 12:50:34PM +0000, openldap-its@openldap.org wrote:
> Deletes are irrelevant when polling the log. The log is a queue, appends at the
> tail and deletes from the head. The only check that's required is to see if the
> consumer's cookieCSNs are still present in the log. If not, then records that
> cover the cookie are gone, and a refresh from the mainDB is needed. That's why
> we use the nopresent config on the logDB, because a normal present phase is a
> bunch of work for no extra benefit.

This is false, imagine a multi-sid environment, provider in question is
A, other providers include B and C:
- replica X disconnects
- sid B and C recieve new write operations
- sid B operations reach A in a timely manner
- sid C CSNs are significantly delayed in reaching A
- logpurge kicks in purging some sid B operations (sid C operations
  older than these are retained)
- replica X reconnects to A, sid C csn is chosed as the older CSN for
  some reason, it is found in accesslog, replication continues (with the
  same effect as decscribed in this issue)

I agree nopresent is important for efficient deltasync operation. Just
suggested there is no other use of this configuration option than on a
logDB and we can conflate this in a proposed behavioural change.
Comment 9 Dimitar Stoychev 2022-06-13 20:25:56 UTC
The proposed changes are derived from OpenLDAP Software. All of the modifications to OpenLDAP Software represented in the following changes were developed by Symas Corporation. Symas Corporation has not assigned rights and/or interest in this work to any party. I, Dimitar Stoychev, am authorized by Symas Corporation, my employer, to release this work under the following terms.

Copyright 2022 Symas Corporation
Redistribution and use in source and binary forms, with or without modification, are permitted only as authorized by the OpenLDAP Public License.
Comment 10 Quanah Gibson-Mount 2022-06-23 18:57:33 UTC
head:

  • 69de6c94 
by Dimitar Stoychev at 2022-06-21T16:21:56+00:00 
ITS#9823 Update test043 to check deltasync recovery after accesslog has been purged

  • c64e6635 
by Ondřej Kuzník at 2022-06-21T16:21:56+00:00 
ITS#9823 Check minCSN when setting up delta-log replay


RE26:

  • e56e70b4 
by Dimitar Stoychev at 2022-06-23T18:42:54+00:00 
ITS#9823 Update test043 to check deltasync recovery after accesslog has been purged

  • eea9b838 
by Ondřej Kuzník at 2022-06-23T18:42:59+00:00 
ITS#9823 Check minCSN when setting up delta-log replay


RE25:

  • ff15ef02 
by Dimitar Stoychev at 2022-06-23T18:49:19+00:00 
ITS#9823 Update test043 to check deltasync recovery after accesslog has been purged

  • f674fbee 
by Ondřej Kuzník at 2022-06-23T18:49:23+00:00 
ITS#9823 Check minCSN when setting up delta-log replay
Comment 11 Quanah Gibson-Mount 2022-07-07 21:44:29 UTC
head:

  • 207604c0 
by Ondřej Kuzník at 2022-07-07T21:31:03+01:00 
ITS#9823 Only request minCSN if accesslog is around

RE26:

  • 23ef018c 
by Ondřej Kuzník at 2022-07-07T21:24:38+00:00 
ITS#9823 Only request minCSN if accesslog is around

RE25:

  • fc812cdb 
by Ondřej Kuzník at 2022-07-07T21:25:02+00:00 
ITS#9823 Only request minCSN if accesslog is around
Comment 12 Quanah Gibson-Mount 2024-02-15 18:19:07 UTC
head:

  • 7ade966c 
by Ondřej Kuzník at 2024-02-05T22:57:17+00:00 
ITS#9823 Move to a place that is better associated with accesslog

RE26:

  • fe7ee150 
by Ondřej Kuzník at 2024-02-15T17:55:09+00:00 
ITS#9823 Move to a place that is better associated with accesslog

RE25:

  • c4a8fce7 
by Ondřej Kuzník at 2024-02-15T17:55:05+00:00 
ITS#9823 Move to a place that is better associated with accesslog