5661 – contextCSN gets corrupted on the stand by mirror

Issue 5661 - contextCSN gets corrupted on the stand by mirror

Summary: contextCSN gets corrupted on the stand by mirror

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	documentation (show other issues)
Version:	2.4.11
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-08-19 09:48 UTC by ali.pouya@free.fr
Modified:	2014-08-01 21:04 UTC (History)
CC List:	0 users

See Also:

Attachments
conf.tar.gz (1.28 KB, application/x-gzip) 2008-08-19 12:54 UTC, Gavin Henry	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description ali.pouya@free.fr 2008-08-19 09:48:05 UTC

Full_Name: Ali Pouya
Version: 2.4.11
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (145.242.11.4)


I think there is a documentation issue for OpenLdap 2.4.11 :
The chapter 17.4.4 of the Admin Guide recommends configuring TWO sycrepl
directives for each mirror side. If I do so, the contextCSN of the stand by
mirror gets  corrupted very easily. But if I confugure the mirrors with only ONE
syncrepl directive it's OK.

The test environment :
I have a test directory with two mirrors A (sid=1) and B (sid=2) configured as
recommended in the Admin's Guide, and a replica C connected to A.
The directory contains 10 million objects, and I use the server A for writing
500 000 new ones. 

Very often and without any apparent reason the contextCSN in the memory of B
gets suddenly corrupted while those of A and C are OK.
In this situation the contextCSN of B gets stuck but B continues to receive data
from A.

The value of contextCSN in base 64 is  :

contextCSN: 20080727021429.070493Z#000000#000#000000
contextCSN:: +HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA==

I note that only the part indicating the year (2008) is garbled. May be this
part is handled differently ?

At service shutdown B writes the corrupt contextCSN to the disk.
At service startup B reads the corrupt contextCSN from the disk and begins to
scan ALL of the data base.

Also it sends a sync request to A (a persitent search containing the corrupt
contextCSN in the control field) causing A to scan the WHOLE data base.
The replica C remains safe.

If I reverse the roles of A and B the corruption occurs on A (always on the
stand by mirror).

I have already encountered the contextCSN corruption problem in OpenLdap 2.3 and
this was one of my reasons to migrate to 2.4.11.

Thanks for your HELP
Best Regards
Ali Pouya

Comment 1 Gavin Henry 2008-08-19 10:48:51 UTC

> I think there is a documentation issue for OpenLdap 2.4.11 :
> The chapter 17.4.4 of the Admin Guide recommends configuring TWO
> sycrepl
> directives for each mirror side. If I do so, the contextCSN of the
> stand by
> mirror gets  corrupted very easily. But if I confugure the mirrors
> with only ONE
> syncrepl directive it's OK.

The documentation is correct.
 
> The test environment :
> I have a test directory with two mirrors A (sid=1) and B (sid=2)
> configured as
> recommended in the Admin's Guide, and a replica C connected to A.
> The directory contains 10 million objects, and I use the server A for
> writing
> 500 000 new ones. 
> 
> Very often and without any apparent reason the contextCSN in the
> memory of B
> gets suddenly corrupted while those of A and C are OK.
> In this situation the contextCSN of B gets stuck but B continues to
> receive data
> from A.
> 
> The value of contextCSN in base 64 is  :
> 
> contextCSN: 20080727021429.070493Z#000000#000#000000
> contextCSN:: +HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA==

perl -MMIME::Base64 -e 'print decode_base64("+HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA=="), "\n";'

does look very funny :-(

Can we get your bdb version, your config and the logs of an empty mirrormode
node B pulling in the data loaded in mirrormode A (posted/hosted online somewhere).

Also, has this always happened on the same machine? What are the specs of the servers?

Is this a fresh install?

-- 
Kind Regards,

Gavin Henry.

T +44 (0) 1224 279484
M +44 (0) 7930 323266
F +44 (0) 1224 824887
E ghenry@suretecsystems.com

Open Source. Open Solutions(tm).

http://www.suretecsystems.com/

Comment 2 ali.pouya@free.fr 2008-08-19 12:44:50 UTC


----- Message transféré de ali.pouya@free.fr -----
   Date : Tue, 19 Aug 2008 14:48:53 +0200
     De : ali.pouya@free.fr
Adresse de retour :ali.pouya@free.fr
  Sujet : Re: (ITS#5661) contextCSN gets corrupted on the stand by mirror
      À : ghenry@OpenLDAP.org

Hi Gavin;

Below you find the answers to your questions :

> Can we get your bdb version, your config and the logs of an empty mirrormode
> node B pulling in the data loaded in mirrormode A (posted/hosted online
> somewhere).

The BDB version is 4.6.21.
You find here attached the file conf.tar.gz containing the configuration of B.
The file syncrepl.conf.simple works well, but the file syncrepl.conf.double
garbles the contextCSN (I write more than 1000 entries per minute).
Do you want a log for the 10 million entries ? Which loglevel ?
The problem only happens if there are write operations on A, not if the server A
is stationary.

>
> Also, has this always happened on the same machine? What are the specs of the
> servers?

The problem happens on the stand by server : If I write on B the contextCSN of
A gets corrupted (I have already tested this).

My servers are quadri-processor Xeon 2.2 GHz.
I think this is not related to the hardware but the "year" part of contextCSN is
not well protected against concurrent operations (?).

>
> Is this a fresh install?
Yes for 2.4.11, but I use OpenLdap since 5 years for my different projects.

Best Regards
Ali


>




----- Fin du message transféré -----

Comment 3 Gavin Henry 2008-08-19 12:54:37 UTC

For Ticket records. Please keep to openldap-its

----- Forwarded Message -----
From: "ali pouya" <ali.pouya@free.fr>
To: ghenry@OpenLDAP.org
Sent: Tuesday, 19 August, 2008 1:48:53 PM GMT +00:00 GMT Britain, Ireland, Portugal
Subject: Re: (ITS#5661) contextCSN gets corrupted on the stand by mirror

Hi Gavin;

Below you find the answers to your questions :

> Can we get your bdb version, your config and the logs of an empty mirrormode
> node B pulling in the data loaded in mirrormode A (posted/hosted online
> somewhere).

The BDB version is 4.6.21.
You find here attached the file conf.tar.gz containing the configuration of B.
The file syncrepl.conf.simple works well, but the file syncrepl.conf.double
garbles the contextCSN (I write more than 1000 entries per minute).
Do you want a log for the 10 million entries ? Which loglevel ?
The problem only happens if there are write operations on A, not if the server A
is stationary.

>
> Also, has this always happened on the same machine? What are the specs of the
> servers?

The problem happens on the stand by server : If I write on B the contextCSN of
A gets corrupted (I have already tested this).

My servers are quadri-processor Xeon 2.2 GHz.
I think this is not related to the hardware but the "year" part of contextCSN is
not well protected against concurrent operations (?).

>
> Is this a fresh install?
Yes for 2.4.11, but I use OpenLdap since 5 years for my different projects.

Best Regards
Ali

>

-- 
Kind Regards,

Gavin Henry.
OpenLDAP Engineering Team.

E ghenry@OpenLDAP.org

Community developed LDAP software.

http://www.openldap.org/project/

Comment 4 Gavin Henry 2008-08-19 13:21:16 UTC

----- "ali pouya" <ali.pouya@free.fr> wrote:

> Hi Gavin;
> 
> Below you find the answers to your questions :
> 
> > Can we get your bdb version, your config and the logs of an empty
> mirrormode
> > node B pulling in the data loaded in mirrormode A (posted/hosted
> online
> > somewhere).
> 
> The BDB version is 4.6.21.
> You find here attached the file conf.tar.gz containing the
> configuration of B.

Thanks.

> The file syncrepl.conf.simple works well, but the file
> syncrepl.conf.double
> garbles the contextCSN (I write more than 1000 entries per minute).
> Do you want a log for the 10 million entries ? Which loglevel ?

Nope, not yet. loglevel sync

> The problem only happens if there are write operations on A, not if
> the server A
> is stationary.

Also note that serverID is a *global* directive not per database. Move 
that out of "database bdb".

> > Also, has this always happened on the same machine? What are the
> specs of the
> > servers?
> 
> The problem happens on the stand by server : If I write on B the
> contextCSN of
> A gets corrupted (I have already tested this).
> 
> My servers are quadri-processor Xeon 2.2 GHz.
> I think this is not related to the hardware but the "year" part of
> contextCSN is
> not well protected against concurrent operations (?).
> 
> >
> > Is this a fresh install?
> Yes for 2.4.11, but I use OpenLdap since 5 years for my different
> projects.

OK, well you should then know that 
"rootdn		cn=admin,ou=ressources-dgi,ou=mefi,o=gouv,c=fr"

by passes all ACLs, so you don't need:

access to *
    by dn.base="cn=admin,ou=ressources-dgi,ou=mefi,o=gouv,c=fr" write

-- 
Kind Regards,

Gavin Henry.

T +44 (0) 1224 279484
M +44 (0) 7930 323266
F +44 (0) 1224 824887
E ghenry@suretecsystems.com

Open Source. Open Solutions(tm).

http://www.suretecsystems.com/

Comment 5 ando@openldap.org 2008-08-21 21:08:30 UTC

ali.pouya@free.fr wrote:
> Full_Name: Ali Pouya
> Version: 2.4.11
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (145.242.11.4)
> 
> 
> I think there is a documentation issue for OpenLdap 2.4.11 :
> The chapter 17.4.4 of the Admin Guide recommends configuring TWO sycrepl
> directives for each mirror side. If I do so, the contextCSN of the stand by
> mirror gets  corrupted very easily. But if I confugure the mirrors with only ONE
> syncrepl directive it's OK.
> 
> The test environment :
> I have a test directory with two mirrors A (sid=1) and B (sid=2) configured as
> recommended in the Admin's Guide, and a replica C connected to A.
> The directory contains 10 million objects, and I use the server A for writing
> 500 000 new ones. 
> 
> Very often and without any apparent reason the contextCSN in the memory of B
> gets suddenly corrupted while those of A and C are OK.
> In this situation the contextCSN of B gets stuck but B continues to receive data
> from A.
> 
> The value of contextCSN in base 64 is  :
> 
> contextCSN: 20080727021429.070493Z#000000#000#000000
> contextCSN:: +HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA==

which looks like

4 bytes of garbage + "0802033718.300111Z#000000#001#000000"

I note that, according to the sid values you assigned to servers A and 
B, the first contextCSN should not appear, since it has sid == 0, while 
the second one, apart from the corruption, is plausible (as you're 
writing to server A, with sid == 1).

> I note that only the part indicating the year (2008) is garbled. May be this
> part is handled differently ?

No.

> At service shutdown B writes the corrupt contextCSN to the disk.
> At service startup B reads the corrupt contextCSN from the disk and begins to
> scan ALL of the data base.
> 
> Also it sends a sync request to A (a persitent search containing the corrupt
> contextCSN in the control field) causing A to scan the WHOLE data base.
> The replica C remains safe.

The fact that the two servers scan the whole database is a side effect 
of the incorrect contextCSN; I wouldn't bother, as soon as the 
corruption gets tracked and fixed.

> If I reverse the roles of A and B the corruption occurs on A (always on the
> stand by mirror).
> 
> I have already encountered the contextCSN corruption problem in OpenLdap 2.3 and
> this was one of my reasons to migrate to 2.4.11.

p.


Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
-----------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Fax:     +39 0382 476497
Email:   ando@sys-net.it
-----------------------------------

Comment 6 ali.pouya@free.fr 2008-08-21 22:25:50 UTC

Hi Pierangelo,
>> contextCSN: 20080727021429.070493Z#000000#000#000000
>> contextCSN:: +HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA==
>
> which looks like
>
> 4 bytes of garbage + "0802033718.300111Z#000000#001#000000"
>
Yes, but I would like to bring a precision :
under VI the 4 bytes are handled as 2 characters only. In fact each time 
the problem occurs I repair my database using a BDB C program wich reads 
the first key from id2entry.bdb and writes it on disk.
Then I use vi to fix the contextCSN, before writing the key back to the 
database.
Using vi I do not delete any characters. I only replace them by 20, then 
I fix the rest of the fields.

Another precision : when the first two chars take corrupted, the rest of 
the contextCSN gets stuck and does not follow write operations.

> I note that, according to the sid values you assigned to servers A and 
> B, the first contextCSN should not appear, since it has sid == 0, 
> while the second one, apart from the corruption, is plausible (as 
> you're writing to server A, with sid == 1).
>
Yes.
The contextCSN with sid=0 is there because at the beginning I initiated 
my directory without SID (defaults to 0), then I set two difrent SIDs 
for A and B.


Best Regards
Ali

Comment 7 Gavin Henry 2008-08-22 10:57:48 UTC

> The fact that the two servers scan the whole database is a side effect
> 
> of the incorrect contextCSN; I wouldn't bother, as soon as the 
> corruption gets tracked and fixed.

Is there anything that should be updated for the MirrorMode docs here?

-- 
Kind Regards,

Gavin Henry.
OpenLDAP Engineering Team.

E ghenry@OpenLDAP.org

Community developed LDAP software.

http://www.openldap.org/project/

Comment 8 ando@openldap.org 2008-08-29 14:58:00 UTC

Ali Pouya wrote:
> Hi Pierangelo,
>>> contextCSN: 20080727021429.070493Z#000000#000#000000
>>> contextCSN:: +HYDCTA4MDIwMzM3MTguMzAwMTExWiMwMDAwMDAjMDAxIzAwMDAwMA==
>>
>> which looks like
>>
>> 4 bytes of garbage + "0802033718.300111Z#000000#001#000000"
>>
> Yes, but I would like to bring a precision :
> under VI the 4 bytes are handled as 2 characters only.

That's probably because vi incorrectly interprets that as a multi-byte 
encoding, since it contains garbage.  That's supposed to be a string 
restricted to those chars that are allowed by generalized time, so you 
shouldn't rely on vi guesses based on their actual, erroneous content.

> In fact each time 
> the problem occurs I repair my database using a BDB C program wich reads 
> the first key from id2entry.bdb and writes it on disk.
> Then I use vi to fix the contextCSN, before writing the key back to the 
> database.
> Using vi I do not delete any characters. I only replace them by 20, then 
> I fix the rest of the fields.

Then you'd get year 20 AD!  The 08 you see in your broken entryCSN is 
the month, not the last two digits of the year.

> Another precision : when the first two chars take corrupted, the rest of 
> the contextCSN gets stuck and does not follow write operations.
> 
>> I note that, according to the sid values you assigned to servers A and 
>> B, the first contextCSN should not appear, since it has sid == 0, 
>> while the second one, apart from the corruption, is plausible (as 
>> you're writing to server A, with sid == 1).
>>
> Yes.
> The contextCSN with sid=0 is there because at the beginning I initiated 
> my directory without SID (defaults to 0), then I set two difrent SIDs 
> for A and B.

Can you try a fresh reload of the database(s) stripping out the entryCSN 
and letting slapadd generate them, using the -S <SID> switch (along with 
the -w switch), in order to enforce a SID of 001 (or 002, as you like)?

p.

Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
-----------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Fax:     +39 0382 476497
Email:   ando@sys-net.it
-----------------------------------

Comment 9 ali.pouya@free.fr 2008-09-02 15:33:06 UTC

Pierangelo mazarati wrote :

> Can you try a fresh reload of the database(s) stripping out the entryCSN
> and letting slapadd generate them, using the -S <SID> switch (along with
> the -w switch), in order to enforce a SID of 001 (or 002, as you like)?


Hi Pierangelo,

I made a new directory with only one contextCSN (SID=002) as you recommended,
and reproduced the contextCSN corruption problem several times.

Example1 :
contextCSN:: 0L0NojA5MDIxMjU5NDkuNzMwMjg1WiMwMDAwMDAjMDAyIzAwMDAwMA==

The four corrupted bytes at the beginning are : D0 BD 02 A2 (hex)

Example2 :
contextCSN:: 4I54oTA5MDIxNTE5MTYuMjYzNDIxWiMwMDAwMDAjMDAyIzAwMDAwMA==

The four corrupted bytes at the beginning are : E0 8E 78 A1 (hex)


I insist on the fact that the problem heppens ONLY if I use TWO syncrepl
directives as recommended in the Admin Guide.
If I use only ONE syncrepl directive, I don't reproduce the problem and the
mirrors get synchronized correctly (whichever mirror side I use for writing).
Also the problem happens on the stand by mirror only when therer are write
operations on the active mirror (> 1000 writes per minute).

I do not understand the interest of using TWO syncrepl directives for
mirrormode.

Thanks for your help
Best Regards
Ali

Comment 10 ando@openldap.org 2008-09-02 19:26:37 UTC

ali.pouya@free.fr wrote:

> I made a new directory with only one contextCSN (SID=002) as you recommended,
> and reproduced the contextCSN corruption problem several times.
> 
> Example1 :
> contextCSN:: 0L0NojA5MDIxMjU5NDkuNzMwMjg1WiMwMDAwMDAjMDAyIzAwMDAwMA==
> 
> The four corrupted bytes at the beginning are : D0 BD 02 A2 (hex)
> 
> Example2 :
> contextCSN:: 4I54oTA5MDIxNTE5MTYuMjYzNDIxWiMwMDAwMDAjMDAyIzAwMDAwMA==
> 
> The four corrupted bytes at the beginning are : E0 8E 78 A1 (hex)
> 
> 
> I insist on the fact that the problem heppens ONLY if I use TWO syncrepl
> directives as recommended in the Admin Guide.
> If I use only ONE syncrepl directive, I don't reproduce the problem and the
> mirrors get synchronized correctly (whichever mirror side I use for writing).
> Also the problem happens on the stand by mirror only when therer are write
> operations on the active mirror (> 1000 writes per minute).
> 
> I do not understand the interest of using TWO syncrepl directives for
> mirrormode.

Well, going back to your initial posting, I think you are somehow 
correct.  Rather than not seeing the point of having two syncrepl 
statements (of which only one is supposed to be active), I see it as an 
inconsistent and potentially dangerous configuration.  In fact, the only 
advantage of having two syncrepl statements is related to being able to 
share the same configuration among two symmetric servers (mirror mode, 
multimaster, ...), using the serverID directive to determine what is the 
"right" one.  But in that case, you'd need to have multiple serverID 
directives as well, with the URI field set.  I set up a test system with 
your configuration, and loaded it very heavily, while running the server 
that's supposed to screw up under valgrind.  I haven't seen any issue 
yet, though.

p.

Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
-----------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Fax:     +39 0382 476497
Email:   ando@sys-net.it
-----------------------------------

Comment 11 ali.pouya@free.fr 2008-09-05 12:22:59 UTC

Pierangelo Masarati wrote :

> Well, going back to your initial posting, I think you are somehow 
> correct. ...

So I will use the simple configuration (only one syncrepl directive) for my production site. 

.....

> I set up a test system with 
> your configuration, and loaded it very heavily, while running the server 
> that's supposed to screw up under valgrind.  I haven't seen any issue 
> yet, though.

Neither me : When I run slapd with valgrind I cannot reproduce the problem !
Also if I run slap with detailed log I cannot reproduce it.
Both cases slow down slapd !

Isn't this a problem of simultaneous (concurrent) access to the contextCSN memory zone ?

I will be on vacation for two weeks.
Thanks for your help.
Best Regards
Ali

Comment 12 ando@openldap.org 2008-09-29 17:36:04 UTC

changed notes
changed state Open to Suspended

Comment 13 ando@openldap.org 2008-10-11 12:54:03 UTC

moved from Incoming to Documentation

Comment 14 Carl Johnstone 2009-01-27 15:58:29 UTC

I'm seeing the same think on a 3-way multi-master setup here. Two servers 
(#001 & #002) are currently sat next to each other. The third (#003) is at a 
remote location. We're currently doing all amends through #001, although in 
the long term we'll be doing amends through all the servers.

When I checked them yesterday I spotted that the remote server had a corrupt 
contextCSN for server #001. I dropped the DB and synced both the config and 
data from the main server again overnight. On checking again today the 
contextCSN is once again corrupt.

In my case it's the first 8 bytes rather than the first 4.

I'm running 2.4.13 with bdb 4.7.25 (first 3 patches applied).

build:

./configure --enable-dynamic --enable-crypt --enable-modules=yes  --enable-backends=mod
 --enable-overlays=mod --enable-sql=no --enable-ndb=no


Carl

Comment 15 Quanah Gibson-Mount 2009-03-05 20:58:07 UTC

changed notes
changed state Suspended to Closed

Comment 16 OpenLDAP project 2014-08-01 21:04:08 UTC

corrupt CSNs fixed, can't help bad configs