8100 – Empty accesslog causes issues with delta-syncrepl MMR configurations

Issue 8100 - Empty accesslog causes issues with delta-syncrepl MMR configurations

Summary: Empty accesslog causes issues with delta-syncrepl MMR configurations

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.39
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Duplicates (1):	8921 (view as issue list)
Depends on:
Blocks:

Reported:	2015-04-09 04:42 UTC by Quanah Gibson-Mount
Modified:	2020-03-23 20:45 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Quanah Gibson-Mount 2015-04-09 04:42:45 UTC

Full_Name: Quanah Gibson-Mount
Version: 2.4.39
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (50.25.188.166)


When one has an MMR setup using delta-syncrepl, and the masters get into a
situation where one is out of sync, or adding a new MMR node to an existing
cluster, things will be broken until the new/reloaded node has a write op that
goes to the accesslog DB.  In an existing cluster, where a node is being
reloaded, it causes all nodes to go into an endless looping fallback sync until
that write occurs.

Comment 1 Quanah Gibson-Mount 2017-04-12 16:38:53 UTC

moved from Incoming to Software Bugs

Comment 2 Quanah Gibson-Mount 2017-06-22 16:02:54 UTC

--On Thursday, April 09, 2015 5:42 AM +0000 quanah@openldap.org wrote:

> Full_Name: Quanah Gibson-Mount
> Version: 2.4.39
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (50.25.188.166)
>
>
> When one has an MMR setup using delta-syncrepl, and the masters get into a
> situation where one is out of sync, or adding a new MMR node to an
> existing cluster, things will be broken until the new/reloaded node has a
> write op that goes to the accesslog DB.  In an existing cluster, where a
> node is being reloaded, it causes all nodes to go into an endless looping
> fallback sync until that write occurs.

One possible fix for this, would be to refuse to delete the final entry in 
the accesslog during the purge phase.  That way, the accesslog would never 
be empty.  I'm not sure how difficult this would be to implement, code wise.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 3 Howard Chu 2018-01-26 20:23:47 UTC

quanah@symas.com wrote:
> --On Thursday, April 09, 2015 5:42 AM +0000 quanah@openldap.org wrote:
> 
>> Full_Name: Quanah Gibson-Mount
>> Version: 2.4.39
>> OS: Linux 2.6
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (50.25.188.166)
>>
>>
>> When one has an MMR setup using delta-syncrepl, and the masters get into a
>> situation where one is out of sync, or adding a new MMR node to an
>> existing cluster, things will be broken until the new/reloaded node has a
>> write op that goes to the accesslog DB.  In an existing cluster, where a
>> node is being reloaded, it causes all nodes to go into an endless looping
>> fallback sync until that write occurs.
> 
> One possible fix for this, would be to refuse to delete the final entry in
> the accesslog during the purge phase.  That way, the accesslog would never
> be empty.  I'm not sure how difficult this would be to implement, code wise.

A patch which skips deleting the final entry, and creates an initial dummy log 
entry if needed, is available in 
https://github.com/quanah/openldap-scratch/tree/its8100 for testing.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 Quanah Gibson-Mount 2018-01-29 18:23:24 UTC

--On Friday, January 26, 2018 8:23 PM +0000 hyc@symas.com wrote:

> A patch which skips deleting the final entry, and creates an initial
> dummy log  entry if needed, is available in
> https://github.com/quanah/openldap-scratch/tree/its8100 for testing.

Hi Howard,

When reinstalling a 4-way MMR system from scratch, we still end up in 
REFRESH mode.  In the database I'm loading, there are 4 contextCSN values, 
one per active master:

contextCSN: 20171203010043.825769Z#000000#001#000000
contextCSN: 20171130222521.056018Z#000000#002#000000
contextCSN: 20171130222318.939265Z#000000#003#000000
contextCSN: 20171203041258.811473Z#000000#004#000000

When I start up the first master (serverID 4 in this case), a contextCSN 
value is properly written for it to the underlying db:

Jan 29 10:06:06 anvil4 slapd[1949]: slapd starting
Jan 29 10:06:06 anvil4 slapd[1949]: slap_queue_csn: queueing 0x7f54d4104220 
20171203041258.811473Z#000000#004#000000
Jan 29 10:06:06 anvil4 slapd[1949]: slap_queue_csn: queueing 0x7f54d4104cc0 
20171203041258.811473Z#000000#004#000000
Jan 29 10:06:06 anvil4 slapd[1949]: slap_graduate_commit_csn: removing 
0x7f54d4104cc0 20171203041258.811473Z#000000#004#000000
Jan 29 10:06:06 anvil4 slapd[1949]: slap_graduate_commit_csn: removing 
0x7f54d4104220 20171203041258.811473Z#000000#004#000000


But when I start the other 3 masters, they do not write an entry for their 
CSN, and since there's no CSN value for them on the other masters either, 
they all fall back to REFRESH_DELETE:

Jan 29 10:06:26 anvil4 slapd[1949]: do_syncrep2: rid=003 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE

Even worse, they do this for every master that comes online.

I think the code needs to add an entry to the accesslog for every 
contextCSN value, not just the final contextCSN?

I'll continue testing for the other half of the fix (Deleting all but the 
most recent entry from the accesslog during purge)

Thanks!

--Quanah




--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 5 Quanah Gibson-Mount 2018-01-29 19:03:53 UTC

--On Monday, January 29, 2018 10:23 AM -0800 Quanah Gibson-Mount 
<quanah@symas.com> wrote:

> I'll continue testing for the other half of the fix (Deleting all but the
> most recent entry from the accesslog during purge)

This part appears to work as desired.  I set the purge interval to 10 
minutes, checking every 5 minutes.  Made changes.

All entries but the most recent one were removed after 15 minutes went by.

Made more changes, did the same wait period, and again, all entries but the 
most recent were removed during the next cleanup interval.

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 6 Howard Chu 2018-01-30 21:41:11 UTC

changed notes
changed state Open to Test

Comment 7 Quanah Gibson-Mount 2018-02-09 17:54:35 UTC

changed notes
changed state Test to Release

Comment 8 OpenLDAP project 2018-03-22 19:25:02 UTC

fixed in master
fixed in RE24 (2.4.46)

Comment 9 Quanah Gibson-Mount 2018-03-22 19:25:02 UTC

changed notes
changed state Release to Closed

Comment 10 Quanah Gibson-Mount 2020-03-23 20:45:47 UTC

*** Issue 8921 has been marked as a duplicate of this issue. ***