[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#8490) changes not written to accesslog, causing replicas to loop syncing



--On Thursday, September 01, 2016 8:05 AM +0000 quanah@zimbra.com wrote:

> --On Thursday, September 01, 2016 7:52 AM +0000 quanah@openldap.org wrote:
>
>> Full_Name: Quanah Gibson-Mount
>> Version: OpenLDAP 2.4.44
>> OS: Linux 2.6
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (75.111.52.177)
>>
>>
>> In a 2-node MMR setup.  Node 1 is getting a lot of write traffic.  Both
>> node 1 and node 2 have 3 replicas each.  At some point, a change is
>> received by node 1, which writes the change to its accesslog DB and its
>> primary DB.  It's 3 replicas are all correctly updated.  MMR node 2
>> receives the change, updates its primary DB, but *fails* to write the
>> change to the accesslog DB.  However, it *does* write the CSN update to
>> the accesslog DB successfully.  This causes all of its replicas to also
>> update their CSN.  Then a change comes in triggering a constraint
>> violation on the replicas, but fully accepted by their master.
>
> So the above summary is incorrect.  While 3 replicas did go out of
> sync...  2 belonged to the primary master (node1), and 1 belonged to the
> secondary  master (node 2).  So really, 4 systems didn't log the change
> (MMR node 2,  ldap05, ldap07, ldap09).

Ok, so that's not correct either.  I now have the correct topography:

ldap01 has the following replicas: ldap02, ldap05, ldap07, ldap09
ldap02 has the following replicas: ldap01, ldap06, ldap08, ldap10

So the replicas of ldap01 received the change and rejected it.  ldap02 just 
skipped writing the entry to the accesslog, and as a result, none of its 
replicas ever got the change, and thus they never hit the failure issue of 
err 19, but they all are now lacking this modification entirely.

I would note that every server was loaded today from the same ldap backup, 
so they were all perfectly in sync.

In looking at the LDAP accesslog, what I see is that what should have been 
a modRDN op was stored in the accesslog as a MOD op (the one I noted 
before).  This seems particularly bizarre, because ldap01 should have 
rejected this change as well.  It appears we may have a problem where the 
accesslog DB is updated, but then the change got rejected by the unique 
overlay.


--Quanah

--

Quanah Gibson-Mount