[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Testing for replication failures

Today at 4:27pm, Buchan Milne wrote:

> Hash: SHA1
> adp wrote:
> | While the list is discussing replication, I'd like to bring up the
> issue of
> | determining when replication has failed.
> |
> | Currently, I can only see there being one case I can monitor for: A master
> | and slave get out of sync and so the master begins producing an error
> log of
> | what entries it can't replicate (for example, if the master sends an 'add'
> | but the slave already has that entry). I monitor this by examining if the
> | error log file mtime has changed, and if so, emailing an error.
> |
> | Recently however I found that a mistake had been made the and port 389/tcp
> | on the slave had been firewalled. So the replog was growing and no
> | replication was taking place. Is it possible for me to detect this type of
> | error? I'd like to see a timeout error from slurpd in syslog like
> "Unable to
> | replicate for X hours." Alternatively, is there a file I can monitor that
> | would indicate something is wrong?
> |
> | (Yes, monitoring that we can connect to port 389/tcp would solve *this*
> | problem, but I'm more concerned with the general case.)
> |
> | Basically, I want to be able to easily answer at all times the
> question "Is
> | replication up and working properly?"
> |
> Look in slurpd.status

Specifically comparing the time stamps you will find there to the newest
timestamp listed in the slurpd.replica file in the same directory.
That's what I'm doing (every 10 minutes).

Frank Swasey                    | http://www.uvm.edu/~fcs
Systems Programmer              | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
        === God bless all inhabitants of your planet ===