Full_Name: Quanah Gibson-Mount Version: RE24 OS: Linux 2.6 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (75.111.52.177) On Friday May 1st, reloaded mmr00 from mmr01, so they had exactly the same dataset. Since Friday, significant drift has occurred between the two nodes. One example is the following MODRDN request that was made to mmr00. Note: Clocks are tightly in sync between both nodes. On MMR00, we see the modrdn occur: May 1 16:10:18 zm-mmr00 slapd[52931]: conn=13581 op=2 MODRDN dn="uid=xxxxxx,ou=people,dc=yyyy,dc=net" May 1 16:10:18 zm-mmr00 slapd[52931]: conn=1145 op=181451 SEARCH RESULT tag=101 err=0 nentries=1 text= May 1 16:10:18 zm-mmr00 slapd[52931]: slap_queue_csn: queueing 0x7f2b8c8931c0 20150501201018.574740Z#000000#002#000000 May 1 16:10:18 zm-mmr00 slapd[52931]: slap_queue_csn: queueing 0x14ebeb40 20150501201018.574740Z#000000#002#000000 May 1 16:10:18 zm-mmr00 slapd[52931]: slap_graduate_commit_csn: removing 0x1d53e480 20150501201018.574740Z#000000#002#000000 May 1 16:10:18 zm-mmr00 slapd[52931]: conn=13581 op=2 RESULT tag=109 err=0 text= May 1 16:10:18 zm-mmr00 slapd[52931]: slap_graduate_commit_csn: removing 0x10be8520 20150501201018.574740Z#000000#002#000000 Everything looks good on this host. Now on to MMR01... nearly 1.5 HOURS LATER WE SEE: May 1 17:27:36 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=100 be_modify uid=someotherentry,ou=people,dc=differentdomain,dc=com (0) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201017.390034Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201017.390034Z#000000#00b#000000 (reqStart=20150501201017.000302Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201017.528111Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapdC43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201017.528111Z#000000#00b#000000 (reqStart=20150501201017.000362Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201017.619314Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201017.619314Z#000000#00b#000000 (reqStart=20150501201017.000406Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.502559Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.502559Z#000000#002#000000 (reqStart=20150501201018.000269Z,cn=cecesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.519620Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.519620Z#000000#002#000000 (reqStart=20150501201018.000289Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.574740Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,gignoring 20150501201018.574740Z#000000#002#000000 (reqStart=20150501201018.000336Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.611377Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.611377Z#000000#002#000000 (reqStart=20150501201018.000358Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.577381Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.577381Z#000000#00b#000000 (reqStart=20150501201018.000366Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.628228Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.628228Z#000000#002#000000 (reqStart=20150501201018.000378Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.648496Z#000000#002#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.648496Z#000000#002#000000 (reqStart=20150501201018.000398Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.747799Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.747799Z#000000#00b#000000 (reqStart=20150501201018.000517Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201018.926821Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201018.926821Z#000000#00b#000000 (reqStart=20150501201018.000639Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201019.051637Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201019.051637Z#000000#00b#000000 (reqStart=20150501201019.000052Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201019.107217Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201019.107217Z#000000#00b#000000 (reqStart=20150501201019.000092Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201019.165448Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201019.165448Z#000000#00b#000000 (reqStart=20150501201019.000146Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002,csn=20150501201019.221163Z#000000#00b#000000 May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old, ignoring 20150501201019.221163Z#000000#00b#000000 (reqStart=20150501201019.000190Z,cn=accesslog) May 1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 cookie=rid=100,sid=002 May 1 17:27:37 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 cookie=rid=102,sid=00b,csn=20150501212737.068523Z#000000#00b#000000 May 1 17:27:37 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=102 be_modify uid=anotherdifferententry,ou=people,dc=samedomain,dc=net (0)
--On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote: In addition to the missing entries, multiple differences exist for existing entries as well. At this point, it seems that replication with 2.4.41 is entirely unsafe. @@ -187797,6 +187797,7 @@ zimbraAccountStatus: active zimbraAuthTokens: 1001665890|1430683289367|8.6.0_GA_1153 zimbraAuthTokens: 1429612485|1430681470808|8.6.0_GA_1153 +zimbraAuthTokens: 1452180966|1430677629732|8.6.0_GA_1153 zimbraAuthTokens: 1844325568|1430677341824|8.6.0_GA_1153 zimbraAuthTokens: 1880697989|1430683571359|8.6.0_GA_1153 zimbraAuthTokens: 1954510867|1430680813216|8.6.0_GA_1153 @@ -187808,7 +187809,7 @@ zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150427191452Z zimbraId: c4cc5985-1727-4446-a196-1cbef2fd9e8a -zimbraLastLogonTimestamp: 20150501182221Z +zimbraLastLogonTimestamp: 20150501182703Z zimbraMailDeliveryAddress: xyz@somewhere.com zimbraMailHost: zm-mbs18.neonova.net zimbraMailQuota: 16106127360 @@ -557800,6 +557801,7 @@ zimbraAuthTokens: 1153410464|1430794333078|8.6.0_GA_1153 zimbraAuthTokens: 1153875370|1430796671000|8.6.0_GA_1153 zimbraAuthTokens: 115391065|1430806341396|8.6.0_GA_1153 +zimbraAuthTokens: 1157520768|1430548147751|8.6.0_GA_1153 zimbraAuthTokens: 1187850134|1430806463699|8.6.0_GA_1153 zimbraAuthTokens: 1192041973|1430794095671|8.6.0_GA_1153 zimbraAuthTokens: 1200977988|1430807243978|8.6.0_GA_1153 @@ -557812,6 +557814,7 @@ zimbraAuthTokens: 1229145474|1430807429549|8.6.0_GA_1153 zimbraAuthTokens: 1230922062|1430797919322|8.6.0_GA_1153 zimbraAuthTokens: 1240838648|1430793498690|8.6.0_GA_1153 +zimbraAuthTokens: 1245512453|1430548279974|8.6.0_GA_1153 zimbraAuthTokens: 1256797348|1430790405205|8.6.0_GA_1153 zimbraAuthTokens: 125774918|1430789944147|8.6.0_GA_1153 zimbraAuthTokens: 1257992365|1430800900512|8.6.0_GA_1153 @@ -557821,6 +557824,7 @@ zimbraAuthTokens: 1291211579|1430793942471|8.6.0_GA_1153 zimbraAuthTokens: 1294660932|1430796182543|8.6.0_GA_1153 zimbraAuthTokens: 1299240495|1430807682002|8.6.0_GA_1153 +zimbraAuthTokens: 1303662472|1430548232316|8.6.0_GA_1153 zimbraAuthTokens: 1309389194|1430801132708|8.6.0_GA_1153 zimbraAuthTokens: 1316038520|1430802345713|8.6.0_GA_1153 zimbraAuthTokens: 1323416810|1430775076568|8.6.0_GA_1153 @@ -557888,6 +557892,7 @@ zimbraAuthTokens: 1546458862|1430782672619|8.6.0_GA_1153 zimbraAuthTokens: 155006363|1430792026242|8.6.0_GA_1153 zimbraAuthTokens: 1554772102|1430787555656|8.6.0_GA_1153 +zimbraAuthTokens: 1555500886|1430548271932|8.6.0_GA_1153 zimbraAuthTokens: 1560251088|1430793442122|8.6.0_GA_1153 zimbraAuthTokens: 1560280871|1430777440334|8.6.0_GA_1153 zimbraAuthTokens: 1561800299|1430806414358|8.6.0_GA_1153 @@ -557929,6 +557934,7 @@ zimbraAuthTokens: 1713221378|1430795062424|8.6.0_GA_1153 zimbraAuthTokens: 1714081224|1430794690175|8.6.0_GA_1153 zimbraAuthTokens: 1719102352|1430787580315|8.6.0_GA_1153 +zimbraAuthTokens: 1721546851|1430548275954|8.6.0_GA_1153 zimbraAuthTokens: 1722045342|1430794777719|8.6.0_GA_1153 zimbraAuthTokens: 1728013476|1430807125376|8.6.0_GA_1153 zimbraAuthTokens: 1730079937|1430800179238|8.6.0_GA_1153 @@ -557939,6 +557945,7 @@ zimbraAuthTokens: 1745425399|1430796457601|8.6.0_GA_1153 zimbraAuthTokens: 1752670040|1430789168438|8.6.0_GA_1153 zimbraAuthTokens: 1760000546|1430796630874|8.6.0_GA_1153 +zimbraAuthTokens: 1760771363|1430548094582|8.6.0_GA_1153 zimbraAuthTokens: 1764080232|1430788720545|8.6.0_GA_1153 zimbraAuthTokens: 1764393930|1430769111435|8.6.0_GA_1153 zimbraAuthTokens: 1767121722|1430800106951|8.6.0_GA_1153 @@ -557961,9 +557968,11 @@ zimbraAuthTokens: 1818662300|1430785463584|8.6.0_GA_1153 zimbraAuthTokens: 1819397934|1430784129509|8.6.0_GA_1153 zimbraAuthTokens: 1825833485|1430792892690|8.6.0_GA_1153 +zimbraAuthTokens: 1826263844|1430548283971|8.6.0_GA_1153 zimbraAuthTokens: 1826635337|1430783161924|8.6.0_GA_1153 zimbraAuthTokens: 1828370704|1430798363104|8.6.0_GA_1153 zimbraAuthTokens: 184971068|1430794694175|8.6.0_GA_1153 +zimbraAuthTokens: 1852733329|1430548280003|8.6.0_GA_1153 zimbraAuthTokens: 1854808177|1430794699578|8.6.0_GA_1153 zimbraAuthTokens: 1856036969|1430801152796|8.6.0_GA_1153 zimbraAuthTokens: 1856540960|1430796150392|8.6.0_GA_1153 @@ -558010,6 +558019,7 @@ zimbraAuthTokens: 2030397720|1430788935660|8.6.0_GA_1153 zimbraAuthTokens: 2033511890|1430788572002|8.6.0_GA_1153 zimbraAuthTokens: 2035641769|1430800861823|8.6.0_GA_1153 +zimbraAuthTokens: 20415307|1430547981772|8.6.0_GA_1153 zimbraAuthTokens: 2042478638|1430807440122|8.6.0_GA_1153 zimbraAuthTokens: 2053877671|1430796197464|8.6.0_GA_1153 zimbraAuthTokens: 2059018018|1430796466336|8.6.0_GA_1153 @@ -558060,6 +558070,7 @@ zimbraAuthTokens: 335636684|1430792575890|8.6.0_GA_1153 zimbraAuthTokens: 335985202|1430801830053|8.6.0_GA_1153 zimbraAuthTokens: 345836372|1430800522916|8.6.0_GA_1153 +zimbraAuthTokens: 353782778|1430548279978|8.6.0_GA_1153 zimbraAuthTokens: 357224412|1430799078650|8.6.0_GA_1153 zimbraAuthTokens: 359847881|1430806579447|8.6.0_GA_1153 zimbraAuthTokens: 364380894|1430784519780|8.6.0_GA_1153 @@ -558086,6 +558097,7 @@ zimbraAuthTokens: 464892423|1430792026225|8.6.0_GA_1153 zimbraAuthTokens: 465832041|1430796118188|8.6.0_GA_1153 zimbraAuthTokens: 470587431|1430790706176|8.6.0_GA_1153 +zimbraAuthTokens: 488520030|1430548271921|8.6.0_GA_1153 zimbraAuthTokens: 493433000|1430789116600|8.6.0_GA_1153 zimbraAuthTokens: 49927151|1430800597298|8.6.0_GA_1153 zimbraAuthTokens: 505233009|1430803732532|8.6.0_GA_1153 @@ -558119,6 +558131,7 @@ zimbraAuthTokens: 624216082|1430790986205|8.6.0_GA_1153 zimbraAuthTokens: 627453014|1430802588253|8.6.0_GA_1153 zimbraAuthTokens: 63479916|1430800763552|8.6.0_GA_1153 +zimbraAuthTokens: 636439348|1430548271924|8.6.0_GA_1153 zimbraAuthTokens: 638569970|1430766646718|8.6.0_GA_1153 zimbraAuthTokens: 642528535|1430793436206|8.6.0_GA_1153 zimbraAuthTokens: 653581266|1430803568548|8.6.0_GA_1153 @@ -558131,12 +558144,14 @@ zimbraAuthTokens: 674880688|1430790542019|8.6.0_GA_1153 zimbraAuthTokens: 678557445|1430792426265|8.6.0_GA_1153 zimbraAuthTokens: 688306806|1430789944170|8.6.0_GA_1153 +zimbraAuthTokens: 692909864|1430548275944|8.6.0_GA_1153 zimbraAuthTokens: 695873279|1430791878331|8.6.0_GA_1153 zimbraAuthTokens: 702220076|1430791770891|8.6.0_GA_1153 zimbraAuthTokens: 702905805|1430793051338|8.6.0_GA_1153 zimbraAuthTokens: 703978454|1430801188894|8.6.0_GA_1153 zimbraAuthTokens: 704499344|1430797009784|8.6.0_GA_1153 zimbraAuthTokens: 709844890|1430790284534|8.6.0_GA_1153 +zimbraAuthTokens: 712623687|1430548275943|8.6.0_GA_1153 zimbraAuthTokens: 714322774|1430801693731|8.6.0_GA_1153 zimbraAuthTokens: 714539921|1430794312192|8.6.0_GA_1153 zimbraAuthTokens: 719501426|1430806353190|8.6.0_GA_1153 @@ -906794,14 +906809,20 @@ uid: something userPassword:: a hash zimbraAccountStatus: active +zimbraAuthTokens: 1076665182|1430495517704|8.6.0_GA_1153 +zimbraAuthTokens: 1194289445|1430495472230|8.6.0_GA_1153 +zimbraAuthTokens: 1400191457|1430488755054|8.6.0_GA_1153 zimbraAuthTokens: 1446436024|1430910444064|8.6.0_GA_1153 zimbraAuthTokens: 1494232116|1430754183294|8.6.0_GA_1153 zimbraAuthTokens: 1548373083|1430859654841|8.6.0_GA_1153 +zimbraAuthTokens: 1549007812|1430488727192|8.6.0_GA_1153 zimbraAuthTokens: 1586033245|1430759173317|8.6.0_GA_1153 zimbraAuthTokens: 1751858048|1430924809541|8.6.0_GA_1153 +zimbraAuthTokens: 1800356229|1430488677708|8.6.0_GA_1153 zimbraAuthTokens: 2062630693|1430852313888|8.6.0_GA_1153 zimbraAuthTokens: 480907313|1430754137307|8.6.0_GA_1153 zimbraAuthTokens: 718652654|1430852353198|8.6.0_GA_1153 +zimbraAuthTokens: 800231695|1430488640026|8.6.0_GA_1153 zimbraAuthTokens: 911489049|1430919384737|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150407152511Z @@ -2342424,13 +2342445,11 @@ uid: something userPassword:: a hash zimbraAccountStatus: active -zimbraAuthTokens: 1032074923|1430678006084|8.6.0_GA_1153 -zimbraAuthTokens: 1359481243|1430678139287|8.6.0_GA_1153 zimbraAuthTokens: 1367205599|1430685608493|8.6.0_GA_1153 zimbraAuthTokens: 146136864|1430743294805|8.6.0_GA_1153 -zimbraAuthTokens: 1583721154|1430678002509|8.6.0_GA_1153 zimbraAuthTokens: 1938799710|1430831569979|8.6.0_GA_1153 -zimbraAuthTokens: 326869791|1430678012645|8.6.0_GA_1153 +zimbraAuthTokens: 1986204084|1430485163391|8.6.0_GA_1153 +zimbraAuthTokens: 701958474|1430488599819|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150311184123Z zimbraCsrfTokenData: data @@ -2977160,8 +2977179,8 @@ uid: something userPassword: a hash zimbraAccountStatus: active -zimbraAuthTokens: 1592805688|1430678263487|8.6.0_GA_1153 zimbraAuthTokens: 1840536155|1430747750113|8.6.0_GA_1153 +zimbraAuthTokens: 297688289|1430677809486|8.6.0_GA_1153 zimbraAuthTokens: 447463167|1430585978383|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150407152611Z @@ -4051413,7 +4051432,6 @@ uid: something userPassword:: a hash zimbraAccountStatus: active -zimbraAuthTokens: 1906714231|1430457251759|8.6.0_GA_1153 zimbraAuthTokens: 637421329|1430881121421|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150324191333Z @@ -5306187,7 +5306205,6 @@ userPassword:: a hash zimbraAccountStatus: active zimbraAuthTokens: 1215263276|1430662360810|8.6.0_GA_1153 -zimbraAuthTokens: 2103162606|1430678212226|8.6.0_GA_1153 zimbraAuthTokens: 31692696|1430712827710|8.6.0_GA_1153 zimbraAuthTokens: 496395190|1430702599395|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 @@ -7689287,6 +7689304,7 @@ userPassword:: a hash zimbraAccountStatus: active zimbraAuthTokens: 1361047436|1430788305556|8.6.0_GA_1153 +zimbraAuthTokens: 1379747754|1430499334346|8.6.0_GA_1153 zimbraAuthTokens: 1584426902|1430875712010|8.6.0_GA_1153 zimbraAuthTokens: 1927293924|1430875661923|8.6.0_GA_1153 zimbraAuthTokens: 979685923|1430788353488|8.6.0_GA_1153 @@ -8198997,8 +8199015,10 @@ uid: something userPassword:: a hash zimbraAccountStatus: active +zimbraAuthTokens: 111522850|1430323733525|8.6.0_GA_1153 zimbraAuthTokens: 1643425923|1430886776671|8.6.0_GA_1153 zimbraAuthTokens: 1728545135|1430720880858|8.6.0_GA_1153 +zimbraAuthTokens: 757970470|1430493817783|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150415194756Z zimbraId: 5147a109-a13b-4264-985e-c1e4999261da @ -9048815,7 +9048835,6 @@ uid: something userPassword:: a hash zimbraAccountStatus: active -zimbraAuthTokens: 820727281|1430672042988|8.6.0_GA_1153 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62 zimbraCreateTimestamp: 20150311185757Z zimbraCsrfTokenData: 69643d33363a65643532303936632d636336392d346136362d626137342d3937366538643337616365363b6578703d31333a313432383738303434383930303b7369643d31303a2d3730363739343338353b:f0a19918939e46e2a5a1b69ba91 710de:1428780448900 @@ -12477050,12 +12477043,12 @@ createTimestamp: 20141208203040Z creatorsName: uid=zimbra,cn=admins,cn=zimbra displayName: something -entryCSN: 20150426144745.317031Z#000000#001#000000 +entryCSN: 20150501183105.859472Z#000000#002#000000 entryUUID: d1d1f11e-1364-1034-8f85-d75f97c57676 givenName: something mail: someone@somewhere.net modifiersName: uid=zimbra,cn=admins,cn=zimbra -modifyTimestamp: 20150426144745Z +modifyTimestamp: 20150501183105Z objectClass: amavisAccount objectClass: inetOrgPerson objectClass: zimbraAccount (etc) --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Friday, May 08, 2015 7:56 PM +0000 quanah@zimbra.com wrote: > > > --On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote: > > > In addition to the missing entries, multiple differences exist for > existing entries as well. At this point, it seems that replication with > 2.4.41 is entirely unsafe. The existing entry drift seems to be related to the locked up replication that occured because of a bug in openssl. So the real issue is just the entries that weren't created, which is a current known limitation of syncrepl refresh that needs fixing. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Monday, May 11, 2015 7:32 PM +0000 quanah@zimbra.com wrote: > The existing entry drift seems to be related to the locked up replication > that occured because of a bug in openssl. So the real issue is just the > entries that weren't created, which is a current known limitation of > syncrepl refresh that needs fixing. Howard suggests that at least in the delta-syncrepl case, we can modify syncprov to use the accesslog as the memory for its syncprov session log rather than the session log used with standard syncrepl. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Monday, May 11, 2015 7:32 PM +0000 quanah@zimbra.com wrote: > --On Friday, May 08, 2015 7:56 PM +0000 quanah@zimbra.com wrote: > >> >> >> --On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote: >> >> >> In addition to the missing entries, multiple differences exist for >> existing entries as well. At this point, it seems that replication with >> 2.4.41 is entirely unsafe. > > The existing entry drift seems to be related to the locked up replication > that occured because of a bug in openssl. So the real issue is just the > entries that weren't created, which is a current known limitation of > syncrepl refresh that needs fixing. Unfortunately, I'm now seeing this at a second customer site with 2-node MMR, where it is running current RE24 (2.4.41). In this case: ldap01 and ldap02 were stopped openldap was upgraded from 2.4.39 to 2.4.41 ldap01's db was slapcat'd ldap02's db was erased ldap02's db was recreated from ldap01's slapcat Now they are having drift, and constant refresh troubles: [zimbra@ldap01-dc01 ~]$ grep REFRESH /var/log/zimbra.log May 15 08:56:02 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515065528.000299Z,cn=accesslog), switching to REFRESH May 15 08:58:07 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 15 09:57:18 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515075718.000081Z,cn=accesslog), switching to REFRESH May 15 10:46:57 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515084653.000032Z,cn=accesslog), switching to REFRESH May 15 12:43:18 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515104318.000049Z,cn=accesslog), switching to REFRESH May 15 16:32:19 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515143211.000250Z,cn=accesslog), switching to REFRESH May 15 16:36:41 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515143556.000046Z,cn=accesslog), switching to REFRESH May 15 17:23:10 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515152310.000002Z,cn=accesslog), switching to REFRESH [zimbra@ldap02-dc01 xx]$ grep FRESH /var/log/zimbra.log May 15 08:35:46 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515063539.000009Z,cn=accesslog), switching to REFRESH May 15 08:47:44 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515064733.000043Z,cn=accesslog), switching to REFRESH May 15 08:56:03 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515065528.000234Z,cn=accesslog), switching to REFRESH May 15 08:57:24 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 15 09:31:12 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515073112.000086Z,cn=accesslog), switching to REFRESH May 15 10:46:58 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515084653.000003Z,cn=accesslog), switching to REFRESH May 15 13:44:11 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515114405.000313Z,cn=accesslog), switching to REFRESH May 15 14:54:49 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515125448.000002Z,cn=accesslog), switching to REFRESH May 15 17:23:11 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync lost sync on (reqStart=20150515152311.000035Z,cn=accesslog), switching to REFRESH This is caused by modify ops: [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=xxx,dc=com (20) May 15 08:47:44 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=yyy,dc=co,dc=za (20) May 15 08:56:03 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=zzz,dc=co,dc=za (20) May 15 09:31:12 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=www,dc=co,dc=za (20) May 15 10:46:58 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=ppp,dc=co,dc=za (20) May 15 13:44:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=rrr,dc=co,dc=za (20) May 15 14:54:49 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=zzz,dc=com (20) May 15 17:23:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 be_modify uid=xxx,ou=people,dc=zzz,dc=com (20) So overall, there seems to be significant problems with RE24 in relation to replication at this time. Clocks are tightly in sync. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Friday, May 15, 2015 7:25 PM +0000 quanah@zimbra.com wrote: > Clocks are tightly in sync. Seeing tons of "greater than snapshot errors". 108 so far today on one host, 144 on the other. Example of one: May 15 17:33:19 ldap01-dc01 slapd[31479]: Entry uid=xxxx,ou=people,dc=yyyy,dc=co,dc=za CSN 20150515153317.592677Z#000000#002#000000 greater than snapshot 20150515153317.320608Z#000000#002#000000 --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
quanah@zimbra.com wrote: > This is caused by modify ops: > > [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log > May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=xxx,dc=com (20) Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of mods that were already performed. Sounds more like a serverID or URL misconfig here if there are actually only 2 servers talking. > May 15 08:47:44 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=yyy,dc=co,dc=za (20) > May 15 08:56:03 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=zzz,dc=co,dc=za (20) > May 15 09:31:12 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=www,dc=co,dc=za (20) > May 15 10:46:58 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=ppp,dc=co,dc=za (20) > May 15 13:44:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=rrr,dc=co,dc=za (20) > May 15 14:54:49 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=zzz,dc=com (20) > May 15 17:23:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 > be_modify uid=xxx,ou=people,dc=zzz,dc=com (20) > > So overall, there seems to be significant problems with RE24 in relation to > replication at this time. > > Clocks are tightly in sync. > > --Quanah > > -- > > Quanah Gibson-Mount > Platform Architect > Zimbra, Inc. > -------------------- > Zimbra :: the leader in open source messaging and collaboration > > > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Friday, May 15, 2015 8:39 PM +0100 Howard Chu <hyc@symas.com> wrote: > quanah@zimbra.com wrote: >> This is caused by modify ops: >> >> [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log >> May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 >> be_modify uid=xxx,ou=people,dc=xxx,dc=com (20) > > Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of > mods that were already performed. Sounds more like a serverID or URL > misconfig here if there are actually only 2 servers talking. ServerID's are unique (2 & 3). There is no URL portion to the config, as we're not replicating cn=config. In 6 days, the servers have gotten totally out of sync with one another, and REFRESH mode isn't fixing it either. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Friday, May 15, 2015 7:45 PM +0000 quanah@zimbra.com wrote: > --On Friday, May 15, 2015 8:39 PM +0100 Howard Chu <hyc@symas.com> wrote: > >> quanah@zimbra.com wrote: >>> This is caused by modify ops: >>> >>> [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log >>> May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: >>> rid=201 be_modify uid=xxx,ou=people,dc=xxx,dc=com (20) >> >> Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of >> mods that were already performed. Sounds more like a serverID or URL >> misconfig here if there are actually only 2 servers talking. > > ServerID's are unique (2 & 3). There is no URL portion to the config, as > we're not replicating cn=config. In 6 days, the servers have gotten > totally out of sync with one another, and REFRESH mode isn't fixing it > either. Note that the drift is occurring on an attribute that is multi-valued: zimbraPasswordLockoutFailureTime The values stored in the attribute are only granular down to the second, so it's not uncommon for multiple writes of the same value (add/mod/delete) to occur for this attribute. --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Friday, May 15, 2015 7:53 PM +0000 quanah@zimbra.com wrote: Capturing what was discussed of this issue in IRC at the time: <hyc> too bad. yes, there is a case where a refresh can delete entries that shouldn't have been deleted <hyc> we discussed it here a month or so ago <hyc> but if all servers were in persist phase with delta, that shouldn't happen <JoBbZ> hm.. should have been out of refresh <JoBbZ> it did a refresh @ zimbra.log.30:May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> but that was hours before <hyc> and it was also behind by 1.5 hours <hyc> ... <JoBbZ> is there a message logged when refresh mode ends? <hyc> that's the message. <hyc> at least, that is the message from the provider that it's done sending refresh messages <hyc> then the consumer runs syncrepl_del_nonpresent based on the provider messages <JoBbZ> zimbra.log.30:May 1 18:37:33 zm-mmr01 slapd[43243]: syncrepl_del_nonpresent: rid=102 be_delete uid=skoelzer,ou=people,dc=unitedfiber,dc=email (0) <JoBbZ> to <JoBbZ> zimbra.log.30:May 1 18:37:35 zm-mmr01 slapd[43243]: syncrepl_del_nonpresent: rid=101 be_delete uid=fsodell2,ou=people,dc=ntin,dc=net (32) <JoBbZ> why would refresh mode end @ 14:146, and then 4 hours later, it does del_nonpresent? <hyc> it might have had a lot of refresh messages to parse <hyc> but 4 hours seems excessive, yeah <JoBbZ> but it was replicating during that time too <JoBbZ> May 1 15:00:00 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=102 be_modify uid=gibbonsj,ou=people,dc=bevcomm,dc=net (0) <JoBbZ> for example <JoBbZ> but thousands and thousands <hyc> that's rid=102 <hyc> you were looking at rid=101 <JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> they both finished around the same time <hyc> this is 4way so where's the 3rd rid <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> hm <JoBbZ> this is interesting.. <JoBbZ> May 1 14:18:45 zm-mmr01 slapd[36133]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> so it did one sync then <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> another sync then <JoBbZ> a sync here <JoBbZ> May 1 18:37:34 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:37:35 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> and one at <JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> ugh <hyc> no rid=100 from the last 2? <JoBbZ> nope <JoBbZ> let me go look at the rid assignments rq <hyc> that's suspicious <JoBbZ> hm.. i bet it is losing connections <JoBbZ> 101/102 are in nashville <JoBbZ> 100 is in greenville, along with this server <JoBbZ> yeah, full fallback triggered at 18:37:04 <JoBbZ> sigh <JoBbZ> but this was a modrdn and an account add <JoBbZ> no deletes <hyc> was the Add the missing entry? <JoBbZ> yeah <hyc> yeah, that's the known issue then <JoBbZ> ok, what about the modrdn? <hyc> which isn't in an ITS so we should probably use the one you just filed <hyc> haven't thought about the modrdn but it's probably the same thing. <JoBbZ> ok <hyc> in refresh without syncprov sessionlog, the provider has to send the consumer a list of entryUUIDs of every entry that exists on the provider <hyc> the consumer then searches every entry in the consumer, compares to the list of entryUUIDs from the provider, and deletes any entry that's not in the provider's list <hyc> the problem is that an entry can be missing from this particular provider's list because its update was coming from a different provider <hyc> so it gets deleted even though it was properly received from the other provider <JoBbZ> ok, makes sense <hyc> if you have a syncprov sessionlog that's large enough to cover the span of disconnected time <hyc> then the provider sends a list of deleted entryUUIDs instead of present UUIDs <hyc> and this bug can't happen then --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
moved from Incoming to Software Bugs
changed notes
This is my understanding of the above discussion: - deltasync consumer has just switched to full refresh (but is ahead from this provider in some ways) - provider sends the present list - consumer deletes extra entries, builds a new cookie - problem is that the new cookie is built to reflect the union of both the local and received cookies even though we may have undone some of the changes which we then ignore If that's accurate, there are some approaches that could fix it: 1. Simple one is to remember the actual cookie we got from the server and refuse to delete entries with entryCSN ahead of the provided CSN set. Problem is that we get even further from being able to replicate from a generic RFC4533 provider. 2. Instead, when present phase is initiated, we might terminate all other sessions, adopt the complete CSN set and restart them only once the new CSN set has been fully established. Also, whenever we fall back from deltasync into plain syncrepl, we should make sure that the accesslog entries we generate from this are never used for further replication which might be thought to be a separate issue. Maybe the ITS#8486 work might be useful for this if we have a way of signalling to accesslog to reset minCSN accordingly to the new CSN set. The former is simpler, but the latter feels like the only one that actually addresses these problems in full. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
ondra@mistotebe.net wrote: > This is my understanding of the above discussion: > - deltasync consumer has just switched to full refresh (but is ahead > from this provider in some ways) > - provider sends the present list > - consumer deletes extra entries, builds a new cookie > - problem is that the new cookie is built to reflect the union of both > the local and received cookies even though we may have undone some of > the changes which we then ignore > > If that's accurate, there are some approaches that could fix it: > > 1. Simple one is to remember the actual cookie we got from the server > and refuse to delete entries with entryCSN ahead of the provided CSN > set. Problem is that we get even further from being able to replicate > from a generic RFC4533 provider. > > 2. Instead, when present phase is initiated, we might terminate all > other sessions, adopt the complete CSN set and restart them only once > the new CSN set has been fully established. (2) makes sense. > > Also, whenever we fall back from deltasync into plain syncrepl, we > should make sure that the accesslog entries we generate from this are > never used for further replication which might be thought to be a > separate issue. That should already be the case, since none of these ops will have a valid CSN. > Maybe the ITS#8486 work might be useful for this if > we have a way of signalling to accesslog to reset minCSN accordingly > to the new CSN set. > > The former is simpler, but the latter feels like the only one that > actually addresses these problems in full. > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote: > ondra@mistotebe.net wrote: >> This is my understanding of the above discussion: >> - deltasync consumer has just switched to full refresh (but is ahead >> from this provider in some ways) >> - provider sends the present list >> - consumer deletes extra entries, builds a new cookie >> - problem is that the new cookie is built to reflect the union of both >> the local and received cookies even though we may have undone some of >> the changes which we then ignore >> >> If that's accurate, there are some approaches that could fix it: >> >> 1. Simple one is to remember the actual cookie we got from the server >> and refuse to delete entries with entryCSN ahead of the provided CSN >> set. Problem is that we get even further from being able to replicate >> from a generic RFC4533 provider. >> >> 2. Instead, when present phase is initiated, we might terminate all >> other sessions, adopt the complete CSN set and restart them only once >> the new CSN set has been fully established. > > (2) makes sense. >> >> Also, whenever we fall back from deltasync into plain syncrepl, we >> should make sure that the accesslog entries we generate from this are >> never used for further replication which might be thought to be a >> separate issue. > > That should already be the case, since none of these ops will have a valid CSN. I faintly remember Quanah seeing these accesslog entries used by consumers at some point, but I might be mistaken. The more general point is making sure its potential syncrepl consumer not even try and use the accesslog entries we added before these - the refresh has created a strange gap in the middle (or worse, duplicated ops if a contextCSN element jumped backwards). But if we enforced that, the question is how to get modifications originating from this replica replicated elsewhere - unless we decide they can't be salvaged? And should the contextCSN reset terminate not just all inbound syncrepl sessions, but the outbound ones as well? -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
Ondřej Kuzník wrote: > On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote: >> ondra@mistotebe.net wrote: >>> Also, whenever we fall back from deltasync into plain syncrepl, we >>> should make sure that the accesslog entries we generate from this are >>> never used for further replication which might be thought to be a >>> separate issue. >> >> That should already be the case, since none of these ops will have a valid CSN. > > I faintly remember Quanah seeing these accesslog entries used by > consumers at some point, but I might be mistaken. > > The more general point is making sure its potential syncrepl consumer > not even try and use the accesslog entries we added before these - the > refresh has created a strange gap in the middle (or worse, duplicated > ops if a contextCSN element jumped backwards). But if we enforced that, > the question is how to get modifications originating from this replica > replicated elsewhere - unless we decide they can't be salvaged? We could set the replica to reject user mods while in refresh phase. Not sure how friendly that is, whether apps would be smart enough to retry somewhere else. > And should the contextCSN reset terminate not just all inbound syncrepl > sessions, but the outbound ones as well? Need to be careful about race conditions here, or you could end up with all nodes just terminating each other and everything halting. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On Mon, Oct 15, 2018 at 02:54:56PM +0000, hyc@symas.com wrote: > Ondřej Kuzník wrote: >> On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote: >>> ondra@mistotebe.net wrote: >>>> Also, whenever we fall back from deltasync into plain syncrepl, >>>> we should make sure that the accesslog entries we generate from >>>> this are never used for further replication which might be >>>> thought to be a separate issue. >>> >>> That should already be the case, since none of these ops will have a >>> valid CSN. >> >> I faintly remember Quanah seeing these accesslog entries used by >> consumers at some point, but I might be mistaken. >> >> The more general point is making sure its potential syncrepl consumer >> not even try and use the accesslog entries we added before these - the >> refresh has created a strange gap in the middle (or worse, duplicated >> ops if a contextCSN element jumped backwards). But if we enforced that, >> the question is how to get modifications originating from this replica >> replicated elsewhere - unless we decide they can't be salvaged? > > We could set the replica to reject user mods while in refresh phase. > Not sure how friendly that is, whether apps would be smart enough to > retry somewhere else. The concern here is about changes that have happened before we found out we can't replicate from another server. And it is likely some of these changes are the reason we couldn't reconcile with our provider and would cause the same if we decided to push them. >> And should the contextCSN reset terminate not just all inbound syncrepl >> sessions, but the outbound ones as well? > > Need to be careful about race conditions here, or you could end up > with all nodes just terminating each other and everything halting. Yes, that would actually happen... The existing state seems quite destructive though, if you have that same situation now (two masters in present phase from each other at the same time), you lose data. The question is what is the priority here? Currently it seems we want replication to continue at the expense of losing modifications on conflict. We might at least log that happened and allow someone to revert this decision later. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
See also ITS#6467
(In reply to Ondřej Kuzník from comment #12) > This is my understanding of the above discussion: > - deltasync consumer has just switched to full refresh (but is ahead > from this provider in some ways) > - provider sends the present list > - consumer deletes extra entries, builds a new cookie > - problem is that the new cookie is built to reflect the union of both > the local and received cookies even though we may have undone some of > the changes which we then ignore > > If that's accurate, there are some approaches that could fix it: > > 1. Simple one is to remember the actual cookie we got from the server > and refuse to delete entries with entryCSN ahead of the provided CSN > set. Problem is that we get even further from being able to replicate > from a generic RFC4533 provider. This has actually been done in ITS#9282. > 2. Instead, when present phase is initiated, we might terminate all > other sessions, adopt the complete CSN set and restart them only once > the new CSN set has been fully established. > > Also, whenever we fall back from deltasync into plain syncrepl, we > should make sure that the accesslog entries we generate from this are > never used for further replication which might be thought to be a > separate issue. Maybe the ITS#8486 work might be useful for this if > we have a way of signalling to accesslog to reset minCSN accordingly > to the new CSN set. > > The former is simpler, but the latter feels like the only one that > actually addresses these problems in full. I have some code to do this, terminate only persist sessions when we detect getting into a present refresh. Need a way to reproduce this in current master since a lot of the issues would have been fixed in ITS#9282 and might only be diverging in relayed deltasync, possibly if we're refreshing from two other providers at the same time.
Believed fixed by recent MMR work.