Full_Name: Quanah Gibson-Mount Version: 2.4.39 OS: Linux 2.6 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (50.25.188.166) When one has an MMR setup using delta-syncrepl, and the masters get into a situation where one is out of sync, or adding a new MMR node to an existing cluster, things will be broken until the new/reloaded node has a write op that goes to the accesslog DB. In an existing cluster, where a node is being reloaded, it causes all nodes to go into an endless looping fallback sync until that write occurs.
moved from Incoming to Software Bugs
--On Thursday, April 09, 2015 5:42 AM +0000 quanah@openldap.org wrote: > Full_Name: Quanah Gibson-Mount > Version: 2.4.39 > OS: Linux 2.6 > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (50.25.188.166) > > > When one has an MMR setup using delta-syncrepl, and the masters get into a > situation where one is out of sync, or adding a new MMR node to an > existing cluster, things will be broken until the new/reloaded node has a > write op that goes to the accesslog DB. In an existing cluster, where a > node is being reloaded, it causes all nodes to go into an endless looping > fallback sync until that write occurs. One possible fix for this, would be to refuse to delete the final entry in the accesslog during the purge phase. That way, the accesslog would never be empty. I'm not sure how difficult this would be to implement, code wise. --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
quanah@symas.com wrote: > --On Thursday, April 09, 2015 5:42 AM +0000 quanah@openldap.org wrote: > >> Full_Name: Quanah Gibson-Mount >> Version: 2.4.39 >> OS: Linux 2.6 >> URL: ftp://ftp.openldap.org/incoming/ >> Submission from: (NULL) (50.25.188.166) >> >> >> When one has an MMR setup using delta-syncrepl, and the masters get into a >> situation where one is out of sync, or adding a new MMR node to an >> existing cluster, things will be broken until the new/reloaded node has a >> write op that goes to the accesslog DB. In an existing cluster, where a >> node is being reloaded, it causes all nodes to go into an endless looping >> fallback sync until that write occurs. > > One possible fix for this, would be to refuse to delete the final entry in > the accesslog during the purge phase. That way, the accesslog would never > be empty. I'm not sure how difficult this would be to implement, code wise. A patch which skips deleting the final entry, and creates an initial dummy log entry if needed, is available in https://github.com/quanah/openldap-scratch/tree/its8100 for testing. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Friday, January 26, 2018 8:23 PM +0000 hyc@symas.com wrote: > A patch which skips deleting the final entry, and creates an initial > dummy log entry if needed, is available in > https://github.com/quanah/openldap-scratch/tree/its8100 for testing. Hi Howard, When reinstalling a 4-way MMR system from scratch, we still end up in REFRESH mode. In the database I'm loading, there are 4 contextCSN values, one per active master: contextCSN: 20171203010043.825769Z#000000#001#000000 contextCSN: 20171130222521.056018Z#000000#002#000000 contextCSN: 20171130222318.939265Z#000000#003#000000 contextCSN: 20171203041258.811473Z#000000#004#000000 When I start up the first master (serverID 4 in this case), a contextCSN value is properly written for it to the underlying db: Jan 29 10:06:06 anvil4 slapd[1949]: slapd starting Jan 29 10:06:06 anvil4 slapd[1949]: slap_queue_csn: queueing 0x7f54d4104220 20171203041258.811473Z#000000#004#000000 Jan 29 10:06:06 anvil4 slapd[1949]: slap_queue_csn: queueing 0x7f54d4104cc0 20171203041258.811473Z#000000#004#000000 Jan 29 10:06:06 anvil4 slapd[1949]: slap_graduate_commit_csn: removing 0x7f54d4104cc0 20171203041258.811473Z#000000#004#000000 Jan 29 10:06:06 anvil4 slapd[1949]: slap_graduate_commit_csn: removing 0x7f54d4104220 20171203041258.811473Z#000000#004#000000 But when I start the other 3 masters, they do not write an entry for their CSN, and since there's no CSN value for them on the other masters either, they all fall back to REFRESH_DELETE: Jan 29 10:06:26 anvil4 slapd[1949]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Even worse, they do this for every master that comes online. I think the code needs to add an entry to the accesslog for every contextCSN value, not just the final contextCSN? I'll continue testing for the other half of the fix (Deleting all but the most recent entry from the accesslog during purge) Thanks! --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
--On Monday, January 29, 2018 10:23 AM -0800 Quanah Gibson-Mount <quanah@symas.com> wrote: > I'll continue testing for the other half of the fix (Deleting all but the > most recent entry from the accesslog during purge) This part appears to work as desired. I set the purge interval to 10 minutes, checking every 5 minutes. Made changes. All entries but the most recent one were removed after 15 minutes went by. Made more changes, did the same wait period, and again, all entries but the most recent were removed during the next cleanup interval. --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
changed notes changed state Open to Test
changed notes changed state Test to Release
fixed in master fixed in RE24 (2.4.46)
changed notes changed state Release to Closed
*** Issue 8921 has been marked as a duplicate of this issue. ***