[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (delta-)syncrepl and nagios

On Thursday 09 February 2006 19:57, Samuel Tran wrote:
> On Mon, 2006-02-06 at 14:41 -0500, Aaron Richton wrote:
> > That's been on my todo list for over a year now. (So I'll join in the
> > request for a copy if there is such a script!)
> >
> > If anybody does write this, it's important to note that something that
> > strictly compares contextcsns is likely useless (I think it would just be
> > a false positive disaster). Replication doesn't happen instantly; there
> > should be some sort of configurable threshold for "csns should be within
> > <time>".
> >
> >
> > I've been meaning to ask the list: how many of you check up on your
> > slaves from a consistency perspective? What do you do? (contextcsn is the
> > approach I've wanted to take. Every time I get annoyed enough to write a
> > nagios plugin, I notice that everything is in sync and defer it...)
> I wrote a very generic python script with exhaustive comments/debugging.
> It can be modified to be used as a Nagios script plugin.
> To view a description of the script:
> $ pydoc ldapSynchCheck
> To view the help:
> $ ./ldapSynchCheck.py -h

I guess you didn't look at the perl extension script for BigBrother/Hobbit 
that I posted. It assumes that it will be able to:
1)read sufficient configuration information from cn=config to be able to 
determine all the databases using sync-repl, and the master for each 
database, on any server
2)read the contextCSN for any database on any server
anonymously, but, due to this, requires absolutely no configuration. For use 
with Hobbit, it just needs to be run on the hobbit server, and any host in 
the bb-hosts file just needs 'ol'. Of course, the hobbit server needs to be 
able to access all the LDAP servers involved.

You may want to take a look, so a user of your script doesn't need to provide 
the URIs, but instead can just provide the server to check.


At present, it only goes yellow (not red), since there's no real way to 
determine if the server being 3 months behind (ie you catch the 30 second 
perion it takes to replicate the first change to one database in 3 months) is 
severe enough for an error .. but it does show how far ahead (which could 
indicate checkpointing/recover problems on the master) or behind the slave is 
(so you don't have to compare contextCSNs in your head).

I could take a look at making it work for nagios, but we're phasing nagios 
out, and the only LDAP servers monitored for anything by nagios don't use 


Buchan Milne
ISP Systems Specialist

Attachment: pgp4JqN4zJnDm.pgp
Description: PGP signature