8125 – MMR throws away valid changes causing drift

Issue 8125 - MMR throws away valid changes causing drift

Summary: MMR throws away valid changes causing drift

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	Highest blocker
Target Milestone:	2.5.1
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-05-04 20:00 UTC by Quanah Gibson-Mount
Modified:	2021-06-14 15:44 UTC (History)
CC List:	2 users (show)

See Also:	6467 8768 9059 9580

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Quanah Gibson-Mount 2015-05-04 20:00:43 UTC

Full_Name: Quanah Gibson-Mount
Version: RE24
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.111.52.177)


On Friday May 1st, reloaded mmr00 from mmr01, so they had exactly the same
dataset.  Since Friday, significant drift has occurred between the two nodes. 
One example is the following MODRDN request that was made to mmr00.  Note:
Clocks are tightly in sync between both nodes.

On MMR00, we see the modrdn occur:

May  1 16:10:18 zm-mmr00 slapd[52931]: conn=13581 op=2 MODRDN
dn="uid=xxxxxx,ou=people,dc=yyyy,dc=net"
May  1 16:10:18 zm-mmr00 slapd[52931]: conn=1145 op=181451 SEARCH RESULT tag=101
err=0 nentries=1 text=
May  1 16:10:18 zm-mmr00 slapd[52931]: slap_queue_csn: queueing 0x7f2b8c8931c0
20150501201018.574740Z#000000#002#000000
May  1 16:10:18 zm-mmr00 slapd[52931]: slap_queue_csn: queueing 0x14ebeb40
20150501201018.574740Z#000000#002#000000
May  1 16:10:18 zm-mmr00 slapd[52931]: slap_graduate_commit_csn: removing
0x1d53e480 20150501201018.574740Z#000000#002#000000
May  1 16:10:18 zm-mmr00 slapd[52931]: conn=13581 op=2 RESULT tag=109 err=0
text=
May  1 16:10:18 zm-mmr00 slapd[52931]: slap_graduate_commit_csn: removing
0x10be8520 20150501201018.574740Z#000000#002#000000



Everything looks good on this host.  Now on to MMR01... nearly 1.5 HOURS LATER
WE SEE:

May  1 17:27:36 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=100 be_modify
uid=someotherentry,ou=people,dc=differentdomain,dc=com (0)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201017.390034Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201017.390034Z#000000#00b#000000
(reqStart=20150501201017.000302Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201017.528111Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapdC43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201017.528111Z#000000#00b#000000
(reqStart=20150501201017.000362Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201017.619314Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201017.619314Z#000000#00b#000000
(reqStart=20150501201017.000406Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.502559Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.502559Z#000000#002#000000
(reqStart=20150501201018.000269Z,cn=cecesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.519620Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.519620Z#000000#002#000000
(reqStart=20150501201018.000289Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.574740Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too
old,gignoring 20150501201018.574740Z#000000#002#000000
(reqStart=20150501201018.000336Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.611377Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.611377Z#000000#002#000000
(reqStart=20150501201018.000358Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.577381Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.577381Z#000000#00b#000000
(reqStart=20150501201018.000366Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.628228Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.628228Z#000000#002#000000
(reqStart=20150501201018.000378Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.648496Z#000000#002#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.648496Z#000000#002#000000
(reqStart=20150501201018.000398Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.747799Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.747799Z#000000#00b#000000
(reqStart=20150501201018.000517Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201018.926821Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201018.926821Z#000000#00b#000000
(reqStart=20150501201018.000639Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201019.051637Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201019.051637Z#000000#00b#000000
(reqStart=20150501201019.000052Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201019.107217Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201019.107217Z#000000#00b#000000
(reqStart=20150501201019.000092Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201019.165448Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201019.165448Z#000000#00b#000000
(reqStart=20150501201019.000146Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002,csn=20150501201019.221163Z#000000#00b#000000
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 CSN too old,
ignoring 20150501201019.221163Z#000000#00b#000000
(reqStart=20150501201019.000190Z,cn=accesslog)
May  1 17:27:36 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
cookie=rid=100,sid=002
May  1 17:27:37 zm-mmr01 slapd[43243]: do_syncrep2: rid=102
cookie=rid=102,sid=00b,csn=20150501212737.068523Z#000000#00b#000000
May  1 17:27:37 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=102 be_modify
uid=anotherdifferententry,ou=people,dc=samedomain,dc=net (0)

Comment 1 Quanah Gibson-Mount 2015-05-08 18:57:32 UTC


--On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote:


In addition to the missing entries, multiple differences exist for existing 
entries as well.  At this point, it seems that replication with 2.4.41 is 
entirely unsafe.

@@ -187797,6 +187797,7 @@
 zimbraAccountStatus: active
 zimbraAuthTokens: 1001665890|1430683289367|8.6.0_GA_1153
 zimbraAuthTokens: 1429612485|1430681470808|8.6.0_GA_1153
+zimbraAuthTokens: 1452180966|1430677629732|8.6.0_GA_1153
 zimbraAuthTokens: 1844325568|1430677341824|8.6.0_GA_1153
 zimbraAuthTokens: 1880697989|1430683571359|8.6.0_GA_1153
 zimbraAuthTokens: 1954510867|1430680813216|8.6.0_GA_1153
@@ -187808,7 +187809,7 @@
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150427191452Z
 zimbraId: c4cc5985-1727-4446-a196-1cbef2fd9e8a
-zimbraLastLogonTimestamp: 20150501182221Z
+zimbraLastLogonTimestamp: 20150501182703Z
 zimbraMailDeliveryAddress: xyz@somewhere.com
 zimbraMailHost: zm-mbs18.neonova.net
 zimbraMailQuota: 16106127360
@@ -557800,6 +557801,7 @@
 zimbraAuthTokens: 1153410464|1430794333078|8.6.0_GA_1153
 zimbraAuthTokens: 1153875370|1430796671000|8.6.0_GA_1153
 zimbraAuthTokens: 115391065|1430806341396|8.6.0_GA_1153
+zimbraAuthTokens: 1157520768|1430548147751|8.6.0_GA_1153
 zimbraAuthTokens: 1187850134|1430806463699|8.6.0_GA_1153
 zimbraAuthTokens: 1192041973|1430794095671|8.6.0_GA_1153
 zimbraAuthTokens: 1200977988|1430807243978|8.6.0_GA_1153
@@ -557812,6 +557814,7 @@
 zimbraAuthTokens: 1229145474|1430807429549|8.6.0_GA_1153
 zimbraAuthTokens: 1230922062|1430797919322|8.6.0_GA_1153
 zimbraAuthTokens: 1240838648|1430793498690|8.6.0_GA_1153
+zimbraAuthTokens: 1245512453|1430548279974|8.6.0_GA_1153
 zimbraAuthTokens: 1256797348|1430790405205|8.6.0_GA_1153
 zimbraAuthTokens: 125774918|1430789944147|8.6.0_GA_1153
 zimbraAuthTokens: 1257992365|1430800900512|8.6.0_GA_1153
@@ -557821,6 +557824,7 @@
 zimbraAuthTokens: 1291211579|1430793942471|8.6.0_GA_1153
 zimbraAuthTokens: 1294660932|1430796182543|8.6.0_GA_1153
 zimbraAuthTokens: 1299240495|1430807682002|8.6.0_GA_1153
+zimbraAuthTokens: 1303662472|1430548232316|8.6.0_GA_1153
 zimbraAuthTokens: 1309389194|1430801132708|8.6.0_GA_1153
 zimbraAuthTokens: 1316038520|1430802345713|8.6.0_GA_1153
 zimbraAuthTokens: 1323416810|1430775076568|8.6.0_GA_1153
@@ -557888,6 +557892,7 @@
 zimbraAuthTokens: 1546458862|1430782672619|8.6.0_GA_1153
 zimbraAuthTokens: 155006363|1430792026242|8.6.0_GA_1153
 zimbraAuthTokens: 1554772102|1430787555656|8.6.0_GA_1153
+zimbraAuthTokens: 1555500886|1430548271932|8.6.0_GA_1153
 zimbraAuthTokens: 1560251088|1430793442122|8.6.0_GA_1153
 zimbraAuthTokens: 1560280871|1430777440334|8.6.0_GA_1153
 zimbraAuthTokens: 1561800299|1430806414358|8.6.0_GA_1153
@@ -557929,6 +557934,7 @@
 zimbraAuthTokens: 1713221378|1430795062424|8.6.0_GA_1153
 zimbraAuthTokens: 1714081224|1430794690175|8.6.0_GA_1153
 zimbraAuthTokens: 1719102352|1430787580315|8.6.0_GA_1153
+zimbraAuthTokens: 1721546851|1430548275954|8.6.0_GA_1153
 zimbraAuthTokens: 1722045342|1430794777719|8.6.0_GA_1153
 zimbraAuthTokens: 1728013476|1430807125376|8.6.0_GA_1153
 zimbraAuthTokens: 1730079937|1430800179238|8.6.0_GA_1153
@@ -557939,6 +557945,7 @@
 zimbraAuthTokens: 1745425399|1430796457601|8.6.0_GA_1153
 zimbraAuthTokens: 1752670040|1430789168438|8.6.0_GA_1153
 zimbraAuthTokens: 1760000546|1430796630874|8.6.0_GA_1153
+zimbraAuthTokens: 1760771363|1430548094582|8.6.0_GA_1153
 zimbraAuthTokens: 1764080232|1430788720545|8.6.0_GA_1153
 zimbraAuthTokens: 1764393930|1430769111435|8.6.0_GA_1153
 zimbraAuthTokens: 1767121722|1430800106951|8.6.0_GA_1153
@@ -557961,9 +557968,11 @@
 zimbraAuthTokens: 1818662300|1430785463584|8.6.0_GA_1153
 zimbraAuthTokens: 1819397934|1430784129509|8.6.0_GA_1153
 zimbraAuthTokens: 1825833485|1430792892690|8.6.0_GA_1153
+zimbraAuthTokens: 1826263844|1430548283971|8.6.0_GA_1153
 zimbraAuthTokens: 1826635337|1430783161924|8.6.0_GA_1153
 zimbraAuthTokens: 1828370704|1430798363104|8.6.0_GA_1153
 zimbraAuthTokens: 184971068|1430794694175|8.6.0_GA_1153
+zimbraAuthTokens: 1852733329|1430548280003|8.6.0_GA_1153
 zimbraAuthTokens: 1854808177|1430794699578|8.6.0_GA_1153
 zimbraAuthTokens: 1856036969|1430801152796|8.6.0_GA_1153
 zimbraAuthTokens: 1856540960|1430796150392|8.6.0_GA_1153
@@ -558010,6 +558019,7 @@
 zimbraAuthTokens: 2030397720|1430788935660|8.6.0_GA_1153
 zimbraAuthTokens: 2033511890|1430788572002|8.6.0_GA_1153
 zimbraAuthTokens: 2035641769|1430800861823|8.6.0_GA_1153
+zimbraAuthTokens: 20415307|1430547981772|8.6.0_GA_1153
 zimbraAuthTokens: 2042478638|1430807440122|8.6.0_GA_1153
 zimbraAuthTokens: 2053877671|1430796197464|8.6.0_GA_1153
 zimbraAuthTokens: 2059018018|1430796466336|8.6.0_GA_1153
@@ -558060,6 +558070,7 @@
 zimbraAuthTokens: 335636684|1430792575890|8.6.0_GA_1153
 zimbraAuthTokens: 335985202|1430801830053|8.6.0_GA_1153
 zimbraAuthTokens: 345836372|1430800522916|8.6.0_GA_1153
+zimbraAuthTokens: 353782778|1430548279978|8.6.0_GA_1153
 zimbraAuthTokens: 357224412|1430799078650|8.6.0_GA_1153
 zimbraAuthTokens: 359847881|1430806579447|8.6.0_GA_1153
 zimbraAuthTokens: 364380894|1430784519780|8.6.0_GA_1153
@@ -558086,6 +558097,7 @@
 zimbraAuthTokens: 464892423|1430792026225|8.6.0_GA_1153
 zimbraAuthTokens: 465832041|1430796118188|8.6.0_GA_1153
 zimbraAuthTokens: 470587431|1430790706176|8.6.0_GA_1153
+zimbraAuthTokens: 488520030|1430548271921|8.6.0_GA_1153
 zimbraAuthTokens: 493433000|1430789116600|8.6.0_GA_1153
 zimbraAuthTokens: 49927151|1430800597298|8.6.0_GA_1153
 zimbraAuthTokens: 505233009|1430803732532|8.6.0_GA_1153
@@ -558119,6 +558131,7 @@
 zimbraAuthTokens: 624216082|1430790986205|8.6.0_GA_1153
 zimbraAuthTokens: 627453014|1430802588253|8.6.0_GA_1153
 zimbraAuthTokens: 63479916|1430800763552|8.6.0_GA_1153
+zimbraAuthTokens: 636439348|1430548271924|8.6.0_GA_1153
 zimbraAuthTokens: 638569970|1430766646718|8.6.0_GA_1153
 zimbraAuthTokens: 642528535|1430793436206|8.6.0_GA_1153
 zimbraAuthTokens: 653581266|1430803568548|8.6.0_GA_1153
@@ -558131,12 +558144,14 @@
 zimbraAuthTokens: 674880688|1430790542019|8.6.0_GA_1153
 zimbraAuthTokens: 678557445|1430792426265|8.6.0_GA_1153
 zimbraAuthTokens: 688306806|1430789944170|8.6.0_GA_1153
+zimbraAuthTokens: 692909864|1430548275944|8.6.0_GA_1153
 zimbraAuthTokens: 695873279|1430791878331|8.6.0_GA_1153
 zimbraAuthTokens: 702220076|1430791770891|8.6.0_GA_1153
 zimbraAuthTokens: 702905805|1430793051338|8.6.0_GA_1153
 zimbraAuthTokens: 703978454|1430801188894|8.6.0_GA_1153
 zimbraAuthTokens: 704499344|1430797009784|8.6.0_GA_1153
 zimbraAuthTokens: 709844890|1430790284534|8.6.0_GA_1153
+zimbraAuthTokens: 712623687|1430548275943|8.6.0_GA_1153
 zimbraAuthTokens: 714322774|1430801693731|8.6.0_GA_1153
 zimbraAuthTokens: 714539921|1430794312192|8.6.0_GA_1153
 zimbraAuthTokens: 719501426|1430806353190|8.6.0_GA_1153
@@ -906794,14 +906809,20 @@
 uid: something
 userPassword:: a hash
 zimbraAccountStatus: active
+zimbraAuthTokens: 1076665182|1430495517704|8.6.0_GA_1153
+zimbraAuthTokens: 1194289445|1430495472230|8.6.0_GA_1153
+zimbraAuthTokens: 1400191457|1430488755054|8.6.0_GA_1153
 zimbraAuthTokens: 1446436024|1430910444064|8.6.0_GA_1153
 zimbraAuthTokens: 1494232116|1430754183294|8.6.0_GA_1153
 zimbraAuthTokens: 1548373083|1430859654841|8.6.0_GA_1153
+zimbraAuthTokens: 1549007812|1430488727192|8.6.0_GA_1153
 zimbraAuthTokens: 1586033245|1430759173317|8.6.0_GA_1153
 zimbraAuthTokens: 1751858048|1430924809541|8.6.0_GA_1153
+zimbraAuthTokens: 1800356229|1430488677708|8.6.0_GA_1153
 zimbraAuthTokens: 2062630693|1430852313888|8.6.0_GA_1153
 zimbraAuthTokens: 480907313|1430754137307|8.6.0_GA_1153
 zimbraAuthTokens: 718652654|1430852353198|8.6.0_GA_1153
+zimbraAuthTokens: 800231695|1430488640026|8.6.0_GA_1153
 zimbraAuthTokens: 911489049|1430919384737|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150407152511Z
@@ -2342424,13 +2342445,11 @@
 uid: something
 userPassword:: a hash
 zimbraAccountStatus: active
-zimbraAuthTokens: 1032074923|1430678006084|8.6.0_GA_1153
-zimbraAuthTokens: 1359481243|1430678139287|8.6.0_GA_1153
 zimbraAuthTokens: 1367205599|1430685608493|8.6.0_GA_1153
 zimbraAuthTokens: 146136864|1430743294805|8.6.0_GA_1153
-zimbraAuthTokens: 1583721154|1430678002509|8.6.0_GA_1153
 zimbraAuthTokens: 1938799710|1430831569979|8.6.0_GA_1153
-zimbraAuthTokens: 326869791|1430678012645|8.6.0_GA_1153
+zimbraAuthTokens: 1986204084|1430485163391|8.6.0_GA_1153
+zimbraAuthTokens: 701958474|1430488599819|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150311184123Z
 zimbraCsrfTokenData: data
@@ -2977160,8 +2977179,8 @@
 uid: something
 userPassword: a hash
 zimbraAccountStatus: active
-zimbraAuthTokens: 1592805688|1430678263487|8.6.0_GA_1153
 zimbraAuthTokens: 1840536155|1430747750113|8.6.0_GA_1153
+zimbraAuthTokens: 297688289|1430677809486|8.6.0_GA_1153
 zimbraAuthTokens: 447463167|1430585978383|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150407152611Z
@@ -4051413,7 +4051432,6 @@
 uid: something
 userPassword:: a hash
 zimbraAccountStatus: active
-zimbraAuthTokens: 1906714231|1430457251759|8.6.0_GA_1153
 zimbraAuthTokens: 637421329|1430881121421|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150324191333Z
@@ -5306187,7 +5306205,6 @@
 userPassword:: a hash
 zimbraAccountStatus: active
 zimbraAuthTokens: 1215263276|1430662360810|8.6.0_GA_1153
-zimbraAuthTokens: 2103162606|1430678212226|8.6.0_GA_1153
 zimbraAuthTokens: 31692696|1430712827710|8.6.0_GA_1153
 zimbraAuthTokens: 496395190|1430702599395|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
@@ -7689287,6 +7689304,7 @@
 userPassword:: a hash
 zimbraAccountStatus: active
 zimbraAuthTokens: 1361047436|1430788305556|8.6.0_GA_1153
+zimbraAuthTokens: 1379747754|1430499334346|8.6.0_GA_1153
 zimbraAuthTokens: 1584426902|1430875712010|8.6.0_GA_1153
 zimbraAuthTokens: 1927293924|1430875661923|8.6.0_GA_1153
 zimbraAuthTokens: 979685923|1430788353488|8.6.0_GA_1153
@@ -8198997,8 +8199015,10 @@
 uid: something
 userPassword:: a hash
 zimbraAccountStatus: active
+zimbraAuthTokens: 111522850|1430323733525|8.6.0_GA_1153
 zimbraAuthTokens: 1643425923|1430886776671|8.6.0_GA_1153
 zimbraAuthTokens: 1728545135|1430720880858|8.6.0_GA_1153
+zimbraAuthTokens: 757970470|1430493817783|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150415194756Z
 zimbraId: 5147a109-a13b-4264-985e-c1e4999261da
@ -9048815,7 +9048835,6 @@
 uid: something
 userPassword:: a hash
 zimbraAccountStatus: active
-zimbraAuthTokens: 820727281|1430672042988|8.6.0_GA_1153
 zimbraCOSId: 6a0277ec-2ace-4b16-9764-6fe02f8c7f62
 zimbraCreateTimestamp: 20150311185757Z
 zimbraCsrfTokenData: 
69643d33363a65643532303936632d636336392d346136362d626137342d3937366538643337616365363b6578703d31333a313432383738303434383930303b7369643d31303a2d3730363739343338353b:f0a19918939e46e2a5a1b69ba91
710de:1428780448900
@@ -12477050,12 +12477043,12 @@
 createTimestamp: 20141208203040Z
 creatorsName: uid=zimbra,cn=admins,cn=zimbra
 displayName: something
-entryCSN: 20150426144745.317031Z#000000#001#000000
+entryCSN: 20150501183105.859472Z#000000#002#000000
 entryUUID: d1d1f11e-1364-1034-8f85-d75f97c57676
 givenName: something
 mail: someone@somewhere.net
 modifiersName: uid=zimbra,cn=admins,cn=zimbra
-modifyTimestamp: 20150426144745Z
+modifyTimestamp: 20150501183105Z
 objectClass: amavisAccount
 objectClass: inetOrgPerson
 objectClass: zimbraAccount

(etc)


--Quanah


-- 
Quanah Gibson-Mount
Platform Architect
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 2 Quanah Gibson-Mount 2015-05-11 18:31:44 UTC

--On Friday, May 08, 2015 7:56 PM +0000 quanah@zimbra.com wrote:

>
>
> --On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote:
>
>
> In addition to the missing entries, multiple differences exist for
> existing  entries as well.  At this point, it seems that replication with
> 2.4.41 is  entirely unsafe.

The existing entry drift seems to be related to the locked up replication 
that occured because of a bug in openssl.  So the real issue is just the 
entries that weren't created, which is a current known limitation of 
syncrepl refresh that needs fixing.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 3 Quanah Gibson-Mount 2015-05-11 20:12:46 UTC

--On Monday, May 11, 2015 7:32 PM +0000 quanah@zimbra.com wrote:

> The existing entry drift seems to be related to the locked up replication
> that occured because of a bug in openssl.  So the real issue is just the
> entries that weren't created, which is a current known limitation of
> syncrepl refresh that needs fixing.

Howard suggests that at least in the delta-syncrepl case, we can modify 
syncprov to use the accesslog as the memory for its syncprov session log 
rather than the session log used with standard syncrepl.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 4 Quanah Gibson-Mount 2015-05-15 18:25:24 UTC

--On Monday, May 11, 2015 7:32 PM +0000 quanah@zimbra.com wrote:

> --On Friday, May 08, 2015 7:56 PM +0000 quanah@zimbra.com wrote:
>
>>
>>
>> --On May 4, 2015 at 8:00:45 PM +0000 openldap-its@OpenLDAP.org wrote:
>>
>>
>> In addition to the missing entries, multiple differences exist for
>> existing  entries as well.  At this point, it seems that replication with
>> 2.4.41 is  entirely unsafe.
>
> The existing entry drift seems to be related to the locked up replication
> that occured because of a bug in openssl.  So the real issue is just the
> entries that weren't created, which is a current known limitation of
> syncrepl refresh that needs fixing.

Unfortunately, I'm now seeing this at a second customer site with 2-node 
MMR, where it is running current RE24 (2.4.41).

In this case:

ldap01 and ldap02 were stopped
openldap was upgraded from 2.4.39 to 2.4.41
ldap01's db was slapcat'd
ldap02's db was erased
ldap02's db was recreated from ldap01's slapcat

Now they are having drift, and constant refresh troubles:

[zimbra@ldap01-dc01 ~]$ grep REFRESH /var/log/zimbra.log
May 15 08:56:02 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515065528.000299Z,cn=accesslog), switching to 
REFRESH
May 15 08:58:07 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
May 15 09:57:18 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515075718.000081Z,cn=accesslog), switching to 
REFRESH
May 15 10:46:57 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515084653.000032Z,cn=accesslog), switching to 
REFRESH
May 15 12:43:18 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515104318.000049Z,cn=accesslog), switching to 
REFRESH
May 15 16:32:19 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515143211.000250Z,cn=accesslog), switching to 
REFRESH
May 15 16:36:41 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515143556.000046Z,cn=accesslog), switching to 
REFRESH
May 15 17:23:10 ldap01-dc01 slapd[31479]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515152310.000002Z,cn=accesslog), switching to 
REFRESH

[zimbra@ldap02-dc01 xx]$ grep FRESH /var/log/zimbra.log
May 15 08:35:46 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515063539.000009Z,cn=accesslog), switching to 
REFRESH
May 15 08:47:44 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515064733.000043Z,cn=accesslog), switching to 
REFRESH
May 15 08:56:03 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515065528.000234Z,cn=accesslog), switching to 
REFRESH
May 15 08:57:24 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
May 15 09:31:12 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515073112.000086Z,cn=accesslog), switching to 
REFRESH
May 15 10:46:58 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515084653.000003Z,cn=accesslog), switching to 
REFRESH
May 15 13:44:11 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515114405.000313Z,cn=accesslog), switching to 
REFRESH
May 15 14:54:49 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515125448.000002Z,cn=accesslog), switching to 
REFRESH
May 15 17:23:11 ldap02-dc01 slapd[18167]: do_syncrep2: rid=201 delta-sync 
lost sync on (reqStart=20150515152311.000035Z,cn=accesslog), switching to 
REFRESH


This is caused by modify ops:

[zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log
May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=xxx,dc=com (20)
May 15 08:47:44 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=yyy,dc=co,dc=za (20)
May 15 08:56:03 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=zzz,dc=co,dc=za (20)
May 15 09:31:12 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=www,dc=co,dc=za (20)
May 15 10:46:58 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=ppp,dc=co,dc=za (20)
May 15 13:44:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=rrr,dc=co,dc=za (20)
May 15 14:54:49 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=zzz,dc=com (20)
May 15 17:23:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201 
be_modify uid=xxx,ou=people,dc=zzz,dc=com (20)

So overall, there seems to be significant problems with RE24 in relation to 
replication at this time.

Clocks are tightly in sync.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 5 Quanah Gibson-Mount 2015-05-15 18:33:10 UTC

--On Friday, May 15, 2015 7:25 PM +0000 quanah@zimbra.com wrote:

> Clocks are tightly in sync.

Seeing tons of "greater than snapshot errors".  108 so far today on one 
host, 144 on the other.

Example of one:

May 15 17:33:19 ldap01-dc01 slapd[31479]: Entry 
uid=xxxx,ou=people,dc=yyyy,dc=co,dc=za CSN 
20150515153317.592677Z#000000#002#000000 greater than snapshot 
20150515153317.320608Z#000000#002#000000

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 6 Howard Chu 2015-05-15 18:39:26 UTC

quanah@zimbra.com wrote:
> This is caused by modify ops:
>
> [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log
> May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=xxx,dc=com (20)

Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of mods 
that were already performed. Sounds more like a serverID or URL misconfig here 
if there are actually only 2 servers talking.

> May 15 08:47:44 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=yyy,dc=co,dc=za (20)
> May 15 08:56:03 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=zzz,dc=co,dc=za (20)
> May 15 09:31:12 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=www,dc=co,dc=za (20)
> May 15 10:46:58 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=ppp,dc=co,dc=za (20)
> May 15 13:44:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=rrr,dc=co,dc=za (20)
> May 15 14:54:49 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=zzz,dc=com (20)
> May 15 17:23:11 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
> be_modify uid=xxx,ou=people,dc=zzz,dc=com (20)
>
> So overall, there seems to be significant problems with RE24 in relation to
> replication at this time.
>
> Clocks are tightly in sync.
>
> --Quanah
>
> --
>
> Quanah Gibson-Mount
> Platform Architect
> Zimbra, Inc.
> --------------------
> Zimbra ::  the leader in open source messaging and collaboration
>
>
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 Quanah Gibson-Mount 2015-05-15 18:45:41 UTC

--On Friday, May 15, 2015 8:39 PM +0100 Howard Chu <hyc@symas.com> wrote:

> quanah@zimbra.com wrote:
>> This is caused by modify ops:
>>
>> [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log
>> May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op: rid=201
>> be_modify uid=xxx,ou=people,dc=xxx,dc=com (20)
>
> Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of
> mods that were already performed. Sounds more like a serverID or URL
> misconfig here if there are actually only 2 servers talking.

ServerID's are unique (2 & 3).  There is no URL portion to the config, as 
we're not replicating cn=config.  In 6 days, the servers have gotten 
totally out of sync with one another, and REFRESH mode isn't fixing it 
either.

--Quanah


--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 8 Quanah Gibson-Mount 2015-05-15 18:52:48 UTC

--On Friday, May 15, 2015 7:45 PM +0000 quanah@zimbra.com wrote:

> --On Friday, May 15, 2015 8:39 PM +0100 Howard Chu <hyc@symas.com> wrote:
>
>> quanah@zimbra.com wrote:
>>> This is caused by modify ops:
>>>
>>> [zimbra@ldap02-dc01 xx]$ grep \(20\) /var/log/zimbra.log
>>> May 15 08:35:46 ldap02-dc01 slapd[18167]: syncrepl_message_to_op:
>>> rid=201 be_modify uid=xxx,ou=people,dc=xxx,dc=com (20)
>>
>> Multiple TYPE_OR_VALUE_EXISTS errors indicate you're sending a lot of
>> mods that were already performed. Sounds more like a serverID or URL
>> misconfig here if there are actually only 2 servers talking.
>
> ServerID's are unique (2 & 3).  There is no URL portion to the config, as
> we're not replicating cn=config.  In 6 days, the servers have gotten
> totally out of sync with one another, and REFRESH mode isn't fixing it
> either.

Note that the drift is occurring on an attribute that is multi-valued:

zimbraPasswordLockoutFailureTime

The values stored in the attribute are only granular down to the second, so 
it's not uncommon for multiple writes of the same value (add/mod/delete) to 
occur for this attribute.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 9 Quanah Gibson-Mount 2016-11-01 19:08:09 UTC

--On Friday, May 15, 2015 7:53 PM +0000 quanah@zimbra.com wrote:

Capturing what was discussed of this issue in IRC at the time:

<hyc> too bad. yes, there is a case where a refresh can delete entries that 
shouldn't have been deleted
<hyc> we discussed it here a month or so ago
<hyc> but if all servers were in persist phase with delta, that shouldn't 
happen
<JoBbZ> hm.. should have been out of refresh
<JoBbZ> it did a refresh @ zimbra.log.30:May  1 14:46:02 zm-mmr01 
slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> but that was hours before
<hyc> and it was also behind by 1.5 hours
<hyc> ...
<JoBbZ> is there a message logged when refresh mode ends?
<hyc> that's the message.
<hyc> at least, that is the message from the provider that it's done 
sending refresh messages
<hyc> then the consumer runs syncrepl_del_nonpresent based on the provider 
messages
<JoBbZ> zimbra.log.30:May  1 18:37:33 zm-mmr01 slapd[43243]: 
syncrepl_del_nonpresent: rid=102 be_delete 
uid=skoelzer,ou=people,dc=unitedfiber,dc=email (0)
<JoBbZ> to
<JoBbZ> zimbra.log.30:May  1 18:37:35 zm-mmr01 slapd[43243]: 
syncrepl_del_nonpresent: rid=101 be_delete 
uid=fsodell2,ou=people,dc=ntin,dc=net (32)
<JoBbZ> why would refresh mode end @ 14:146, and then 4 hours later, it 
does del_nonpresent?
<hyc> it might have had a lot of refresh messages to parse
<hyc> but 4 hours seems excessive, yeah
<JoBbZ> but it was replicating during that time too
<JoBbZ> May  1 15:00:00 zm-mmr01 slapd[43243]: syncrepl_message_to_op: 
rid=102 be_modify uid=gibbonsj,ou=people,dc=bevcomm,dc=net (0)
<JoBbZ> for example
<JoBbZ> but thousands and thousands
<hyc> that's rid=102
<hyc> you were looking at rid=101
<JoBbZ> May  1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> they both finished around the same time
<hyc> this is 4way so where's the 3rd rid
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> hm
<JoBbZ> this is interesting..
<JoBbZ> May  1 14:18:45 zm-mmr01 slapd[36133]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> so it did one sync then
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> another sync then
<JoBbZ> a sync here
<JoBbZ> May  1 18:37:34 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:37:35 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> and one at
<JoBbZ> May  1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> ugh
<hyc> no rid=100 from the last 2?
<JoBbZ> nope
<JoBbZ> let me go look at the rid assignments rq
<hyc> that's suspicious
<JoBbZ> hm.. i bet it is losing connections
<JoBbZ> 101/102 are in nashville
<JoBbZ> 100 is in greenville, along with this server
<JoBbZ> yeah, full fallback triggered at 18:37:04
<JoBbZ> sigh
<JoBbZ> but this was a modrdn and an account add
<JoBbZ> no deletes
<hyc> was the Add the missing entry?
<JoBbZ> yeah
<hyc> yeah, that's the known issue then
<JoBbZ> ok, what about the modrdn?
<hyc> which isn't in an ITS so we should probably use the one you just filed
<hyc> haven't thought about the modrdn but it's probably the same thing.
<JoBbZ> ok
<hyc> in refresh without syncprov sessionlog, the provider has to send the 
consumer a list of entryUUIDs of every entry that exists on the provider
<hyc> the consumer then searches every entry in the consumer, compares to 
the list of entryUUIDs from the provider, and deletes any entry that's not 
in the provider's list
<hyc> the problem is that an entry can be missing from this particular 
provider's list because its update was coming from a different provider
<hyc> so it gets deleted even though it was properly received from the 
other provider
<JoBbZ> ok, makes sense
<hyc> if you have a syncprov sessionlog that's large enough to cover the 
span of disconnected time
<hyc> then the provider sends a list of deleted entryUUIDs instead of 
present UUIDs
<hyc> and this bug can't happen then


--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 10 Quanah Gibson-Mount 2017-04-12 16:42:22 UTC

moved from Incoming to Software Bugs

Comment 11 Quanah Gibson-Mount 2017-04-20 16:58:41 UTC

changed notes

Comment 12 Ondřej Kuzník 2018-10-15 11:31:31 UTC

This is my understanding of the above discussion:
- deltasync consumer has just switched to full refresh (but is ahead
  from this provider in some ways)
- provider sends the present list
- consumer deletes extra entries, builds a new cookie
- problem is that the new cookie is built to reflect the union of both
  the local and received cookies even though we may have undone some of
  the changes which we then ignore

If that's accurate, there are some approaches that could fix it:

1. Simple one is to remember the actual cookie we got from the server
   and refuse to delete entries with entryCSN ahead of the provided CSN
   set. Problem is that we get even further from being able to replicate
   from a generic RFC4533 provider.

2. Instead, when present phase is initiated, we might terminate all
   other sessions, adopt the complete CSN set and restart them only once
   the new CSN set has been fully established.

   Also, whenever we fall back from deltasync into plain syncrepl, we
   should make sure that the accesslog entries we generate from this are
   never used for further replication which might be thought to be a
   separate issue. Maybe the ITS#8486 work might be useful for this if
   we have a way of signalling to accesslog to reset minCSN accordingly
   to the new CSN set.

The former is simpler, but the latter feels like the only one that
actually addresses these problems in full.

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Comment 13 Howard Chu 2018-10-15 13:53:18 UTC

ondra@mistotebe.net wrote:
> This is my understanding of the above discussion:
> - deltasync consumer has just switched to full refresh (but is ahead
>   from this provider in some ways)
> - provider sends the present list
> - consumer deletes extra entries, builds a new cookie
> - problem is that the new cookie is built to reflect the union of both
>   the local and received cookies even though we may have undone some of
>   the changes which we then ignore
> 
> If that's accurate, there are some approaches that could fix it:
> 
> 1. Simple one is to remember the actual cookie we got from the server
>    and refuse to delete entries with entryCSN ahead of the provided CSN
>    set. Problem is that we get even further from being able to replicate
>    from a generic RFC4533 provider.
> 
> 2. Instead, when present phase is initiated, we might terminate all
>    other sessions, adopt the complete CSN set and restart them only once
>    the new CSN set has been fully established.

(2) makes sense.
> 
>    Also, whenever we fall back from deltasync into plain syncrepl, we
>    should make sure that the accesslog entries we generate from this are
>    never used for further replication which might be thought to be a
>    separate issue.

That should already be the case, since none of these ops will have a valid CSN.

> Maybe the ITS#8486 work might be useful for this if
>    we have a way of signalling to accesslog to reset minCSN accordingly
>    to the new CSN set.
> 
> The former is simpler, but the latter feels like the only one that
> actually addresses these problems in full.
> 


-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 14 Ondřej Kuzník 2018-10-15 14:46:37 UTC

On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote:
> ondra@mistotebe.net wrote:
>> This is my understanding of the above discussion:
>> - deltasync consumer has just switched to full refresh (but is ahead
>>   from this provider in some ways)
>> - provider sends the present list
>> - consumer deletes extra entries, builds a new cookie
>> - problem is that the new cookie is built to reflect the union of both
>>   the local and received cookies even though we may have undone some of
>>   the changes which we then ignore
>> 
>> If that's accurate, there are some approaches that could fix it:
>> 
>> 1. Simple one is to remember the actual cookie we got from the server
>>    and refuse to delete entries with entryCSN ahead of the provided CSN
>>    set. Problem is that we get even further from being able to replicate
>>    from a generic RFC4533 provider.
>> 
>> 2. Instead, when present phase is initiated, we might terminate all
>>    other sessions, adopt the complete CSN set and restart them only once
>>    the new CSN set has been fully established.
> 
> (2) makes sense.
>> 
>>    Also, whenever we fall back from deltasync into plain syncrepl, we
>>    should make sure that the accesslog entries we generate from this are
>>    never used for further replication which might be thought to be a
>>    separate issue.
> 
> That should already be the case, since none of these ops will have a valid CSN.

I faintly remember Quanah seeing these accesslog entries used by
consumers at some point, but I might be mistaken.

The more general point is making sure its potential syncrepl consumer
not even try and use the accesslog entries we added before these - the
refresh has created a strange gap in the middle (or worse, duplicated
ops if a contextCSN element jumped backwards). But if we enforced that,
the question is how to get modifications originating from this replica
replicated elsewhere - unless we decide they can't be salvaged?

And should the contextCSN reset terminate not just all inbound syncrepl
sessions, but the outbound ones as well?

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Comment 15 Howard Chu 2018-10-15 14:54:45 UTC

Ondřej Kuzník wrote:
> On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote:
>> ondra@mistotebe.net wrote:
>>>    Also, whenever we fall back from deltasync into plain syncrepl, we
>>>    should make sure that the accesslog entries we generate from this are
>>>    never used for further replication which might be thought to be a
>>>    separate issue.
>>
>> That should already be the case, since none of these ops will have a valid CSN.
> 
> I faintly remember Quanah seeing these accesslog entries used by
> consumers at some point, but I might be mistaken.
> 
> The more general point is making sure its potential syncrepl consumer
> not even try and use the accesslog entries we added before these - the
> refresh has created a strange gap in the middle (or worse, duplicated
> ops if a contextCSN element jumped backwards). But if we enforced that,
> the question is how to get modifications originating from this replica
> replicated elsewhere - unless we decide they can't be salvaged?

We could set the replica to reject user mods while in refresh phase. Not sure
how friendly that is, whether apps would be smart enough to retry somewhere else.

> And should the contextCSN reset terminate not just all inbound syncrepl
> sessions, but the outbound ones as well?

Need to be careful about race conditions here, or you could end up with all
nodes just terminating each other and everything halting.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 16 Ondřej Kuzník 2018-10-16 09:33:49 UTC

On Mon, Oct 15, 2018 at 02:54:56PM +0000, hyc@symas.com wrote:
> Ondřej Kuzník wrote:
>> On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote:
>>> ondra@mistotebe.net wrote:
>>>>    Also, whenever we fall back from deltasync into plain syncrepl,
>>>>    we should make sure that the accesslog entries we generate from
>>>>    this are never used for further replication which might be
>>>>    thought to be a separate issue.
>>>
>>> That should already be the case, since none of these ops will have a
>>> valid CSN.
>>
>> I faintly remember Quanah seeing these accesslog entries used by
>> consumers at some point, but I might be mistaken.
>>
>> The more general point is making sure its potential syncrepl consumer
>> not even try and use the accesslog entries we added before these - the
>> refresh has created a strange gap in the middle (or worse, duplicated
>> ops if a contextCSN element jumped backwards). But if we enforced that,
>> the question is how to get modifications originating from this replica
>> replicated elsewhere - unless we decide they can't be salvaged?
> 
> We could set the replica to reject user mods while in refresh phase.
> Not sure how friendly that is, whether apps would be smart enough to
> retry somewhere else.

The concern here is about changes that have happened before we found out
we can't replicate from another server. And it is likely some of these
changes are the reason we couldn't reconcile with our provider and would
cause the same if we decided to push them.

>> And should the contextCSN reset terminate not just all inbound syncrepl
>> sessions, but the outbound ones as well?
> 
> Need to be careful about race conditions here, or you could end up
> with all nodes just terminating each other and everything halting.

Yes, that would actually happen... The existing state seems quite
destructive though, if you have that same situation now (two masters in
present phase from each other at the same time), you lose data.

The question is what is the priority here? Currently it seems we want
replication to continue at the expense of losing modifications on
conflict. We might at least log that happened and allow someone to
revert this decision later.

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Comment 17 OpenLDAP project 2019-04-04 16:29:12 UTC

See also ITS#6467

Comment 18 Quanah Gibson-Mount 2019-04-04 16:29:12 UTC

changed notes

Comment 19 Ondřej Kuzník 2020-11-06 12:02:52 UTC

(In reply to Ondřej Kuzník from comment #12)
> This is my understanding of the above discussion:
> - deltasync consumer has just switched to full refresh (but is ahead
>   from this provider in some ways)
> - provider sends the present list
> - consumer deletes extra entries, builds a new cookie
> - problem is that the new cookie is built to reflect the union of both
>   the local and received cookies even though we may have undone some of
>   the changes which we then ignore
> 
> If that's accurate, there are some approaches that could fix it:
> 
> 1. Simple one is to remember the actual cookie we got from the server
>    and refuse to delete entries with entryCSN ahead of the provided CSN
>    set. Problem is that we get even further from being able to replicate
>    from a generic RFC4533 provider.

This has actually been done in ITS#9282.

> 2. Instead, when present phase is initiated, we might terminate all
>    other sessions, adopt the complete CSN set and restart them only once
>    the new CSN set has been fully established.
> 
>    Also, whenever we fall back from deltasync into plain syncrepl, we
>    should make sure that the accesslog entries we generate from this are
>    never used for further replication which might be thought to be a
>    separate issue. Maybe the ITS#8486 work might be useful for this if
>    we have a way of signalling to accesslog to reset minCSN accordingly
>    to the new CSN set.
> 
> The former is simpler, but the latter feels like the only one that
> actually addresses these problems in full.

I have some code to do this, terminate only persist sessions when we detect getting into a present refresh.

Need a way to reproduce this in current master since a lot of the issues would have been fixed in ITS#9282 and might only be diverging in relayed deltasync, possibly if we're refreshing from two other providers at the same time.

Comment 20 Quanah Gibson-Mount 2021-01-14 18:26:11 UTC

Believed fixed by recent MMR work.