[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Testing the state of replicates

To: openldap-software@openldap.org
Subject: Re: Testing the state of replicates
From: Buchan Milne <bgmilne@staff.telkomsa.net>
Date: Wed, 5 Mar 2008 20:38:36 +0200
Content-disposition: inline
In-reply-to: <Pine.SOC.4.64.0803041744370.18918@toolbox.rutgers.edu>
References: <5D7F3B4EB73FCB42A46515AC30B02B1C0B4E3F@mailnyc2.nyc.deshaw.com> <5D7F3B4EB73FCB42A46515AC30B02B1C0B4E96@mailnyc2.nyc.deshaw.com> <Pine.SOC.4.64.0803041744370.18918@toolbox.rutgers.edu>
User-agent: KMail/1.9.7

On Wednesday 05 March 2008 00:49:21 Aaron Richton wrote:
> [Gavin says]
>
> > Dig the main source. servers/slapd/syncrepl.c and
> > servers/slapd/overlays/syncprov.c
>
> Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
>
>   * These routines are (loosly) based upon draft-ietf-ldup-model-03.txt,
>   * A WORK IN PROGRESS.  The format will likely change.
>   *
>   * The format of a CSN string is: yyyymmddhhmmssz#s#r#c
>   * where s is a counter of operations within a timeslice, r is
>   * the replica id (normally zero), and c is a counter of
>   * modifications within this operation.  s, r, and c are
>   * represented in hex and zero padded to lengths of 6, 3, and
>   * 6, respectively. (In previous implementations r was only 2 digits.)
>
>
> We use
> http://www.openldap.org/lists/openldap-software/200602/msg00158.html,
> maybe with a small mod or two (I forget), to check that contextCSN isn't
> wedged.

I use: http://staff.telkomsa.net/~bgmilne/hobbit/ . However, I don't have a 
reliable algorithm for the case where the replica is marginally out of sync 
(e.g. one change hasn't replicated, and the replica is refreshOnly, the 
change previous to the one that hasn't replicated was above the threshold 
for "critical replication delay). Since some databases have high rates of 
change (4 mods/sec average), and others don't (1/week average), I get false 
positives on the more idle databases ...

> This only works when the syncrepl thread is completely borked. A 
> better check would be something along the lines of the Net::LDAP ldifdiff
> to make sure that nothing's different.

How often would you want to run such a thing, and how long would it take to 
run? ldapsearch -z0 | grep/wc/awk/ usually takes a significant amount of CPU 
time here (orders of magnitude more than slapd does to provide the entire 
data set).

> Of course this has race condition 
> issues (not that we make writes all that often, but on paper at least).

Some of which could be solved by an appropriate search filter?

> If 
> anybody has something like that as a monitoring plugin, you'd erase one
> line off my perpetual todo list...
>
> (Yes, that would be of great interest to me. ~93% of syncrepl bugs we've
> seen involve very very very slight errors that only result in an entry or
> two being wrong. contextCSN being wrong...we pretty much only see that in
> the field when tcp keepalives fail to indicate the need for a
> reconnection.)

There are other possible causes ...

Regards,
Buchan

References:
- Testing the state of replicates
  - From: "Marantz, Roy" <Roy.Marantz@deshaw.com>
- RE: Testing the state of replicates
  - From: "Marantz, Roy" <Roy.Marantz@deshaw.com>
- RE: Testing the state of replicates
  - From: Aaron Richton <richton@nbcs.rutgers.edu>

Prev by Date: RE: Testing the state of replicates
Next by Date: Re: replication problem
Index(es):
- Chronological
- Thread