Issue 9059 - session log parsing triggers cascading REFRESH
Summary: session log parsing triggers cascading REFRESH
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.47
Hardware: All All
: --- normal
Target Milestone: 2.5.0
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-01 15:16 UTC by Quanah Gibson-Mount
Modified: 2020-10-14 21:18 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Quanah Gibson-Mount 2019-08-01 15:16:06 UTC
Full_Name: Quanah Gibson-Mount
Version: 2.4.47
OS: N/A
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (47.208.128.44)


In investigating why a particular consumer went into REFRESH mode, I find that
this was triggered by one of its providers currently parsing a sessionlog. 
Sessionlog parsing should *not* cause a consumer of that server to fall back to
REFRESH.

Server that got dropped into REFRESH incorrectly:

Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001
LDAP_RES_SEARCH_RESULT
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 delta-sync
lost sync, switching to REFRESH
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 (4096) Content
Sync Refresh Required

RID001 server:

....
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127055648.177968Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.323374Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.661263Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.881198Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.514773Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.941766Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.622874Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.977196Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.656726Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.575116Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.672371Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.701608Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.833435Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.766277Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151410.102670Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127055648.192922Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.091604Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.859669Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.160929Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.664055Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.213051Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.666873Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.892859Z#000000#001#000000
....

Comment 1 OpenLDAP project 2019-08-01 15:26:31 UTC
See also ITS#8125
Comment 2 Quanah Gibson-Mount 2019-08-01 15:26:31 UTC
changed notes
Comment 3 Quanah Gibson-Mount 2019-08-01 17:38:50 UTC
--On Thursday, August 01, 2019 4:16 PM +0000 quanah@openldap.org wrote:

> Full_Name: Quanah Gibson-Mount
> Version: 2.4.47
> OS: N/A
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (47.208.128.44)
>
>
> In investigating why a particular consumer went into REFRESH mode, I find
> that this was triggered by one of its providers currently parsing a
> sessionlog.  Sessionlog parsing should *not* cause a consumer of that
> server to fall back to REFRESH.

Some notes on this specific environment and situation.

This is a 4-way MMR setup.  Via the use of a load balancer, only 
iam-ed-ldp-p01 recieves write operations.  Thus there should never be any 
REFRESHes occurring anywhere, so the fact that there is is already 
problematic.

Servers are p01, p02, p03 (all in the same data center) and b01 (offsite DR 
system).

At the time this issue hit:

p01 was upgraded to a new release of OpenLDAP @ 21:26:48, thus we see:
Jul 30 21:25:17 iam-ed-ldp-p01 slapd[30495]: slapd shutdown: waiting for 0 
operations/tasks to finish
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20978]: @(#) $OpenLDAP: slapd 2.4.47 
(Mar 22 2019 09:12:20) 
$#012#011build@rhel7build.rb.symas.net:/home/build/sold-2.4.47.3/redhat7-x86_64/build/openldap.x86_64/servers/slapd
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: slapd starting
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=002 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 
LDAP_RES_SEARCH_RESULT
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 
delta-sync lost sync, switching to REFRESH
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 (4096) 
Content Sync Refresh Required
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=003 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE

p03 was operating w/o issue.  Interstingly enough, when it was restarted 
for the software update @ 21:37:16, it got an error (53) talking to p02 
instead of an error (4096) like p01 did.  This is likely because p02 was 
restarted for its software update @ 21:32:19, which wiped out its 
sessionlog.

p02 had started doing session log parsing @ 21:26:17 with
Jul 30 21:26:17 iam-ed-ldp-p02 slapd[3750]: conn=18205759 op=1 SEARCH 
RESULT tag=101 err=4096 duration=18.462ms nentries=0 text=sync cookie is 
stale

We see the connection to p02 from p01 with a similar issue when it starts 
up:
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SRCH 
base="cn=accesslog" scope=2 deref=0 
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SRCH 
attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SEARCH 
RESULT tag=101 err=4096 duration=0.431ms nentries=0 text=sync cookie is 
stale

b01 had its software update @ 21:21:25

Generally it seems like restarting/wiping the sessionlog made things 
incredibly bad across the cluster.









--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>


Comment 4 Ondřej Kuzník 2020-04-30 12:13:28 UTC
The response is triggered by https://git.openldap.org/openldap/openldap/-/blob/fd23680a447b9efe1a481dd64d9c57f3873f3108/servers/slapd/overlays/syncprov.c#L2886 but it looks like the sessionlog has already been replayed correctly.

In that case, we are either finished or have a persistent search set up and all remaining responses are queued up to be sent, so we shouldn't even care if we can still find the CSN in the DB... Moving that whole `if` under `do_present == 1` should then be enough and it might not be related to bug 8125 at all.

But then I might be missing something.
Comment 5 Howard Chu 2020-04-30 17:06:42 UTC
(In reply to Ondřej Kuzník from comment #4)
> The response is triggered by
> https://git.openldap.org/openldap/openldap/-/blob/
> fd23680a447b9efe1a481dd64d9c57f3873f3108/servers/slapd/overlays/syncprov.
> c#L2886 but it looks like the sessionlog has already been replayed correctly.
> 
> In that case, we are either finished or have a persistent search set up and
> all remaining responses are queued up to be sent, so we shouldn't even care
> if we can still find the CSN in the DB... Moving that whole `if` under
> `do_present == 1` should then be enough and it might not be related to bug
> 8125 at all.
> 
> But then I might be missing something.

Sounds OK. the MinCSN check is to make sure the DB hasn't already moved on
past the consumer's cookie, but if the sessionlog validly spans the consumer
cookie then the check isn't needed.
Comment 6 Quanah Gibson-Mount 2020-05-22 17:42:19 UTC
Commits: 
  • 709d805f 
by Ondřej Kuzník at 2020-05-22T16:57:53+00:00 
ITS#9059 Skip mincsn check if sessionlog replay was successful


  • f3952d94 
by Ondřej Kuzník at 2020-05-22T16:57:53+00:00 
ITS#9059 Document why we do FIND_CSN