Re: Testing for replication failures

adp wrote:
| While the list is discussing replication, I'd like to bring up the
issue of
| determining when replication has failed.
| Currently, I can only see there being one case I can monitor for: A master
| and slave get out of sync and so the master begins producing an error
log of
| what entries it can't replicate (for example, if the master sends an 'add'
| but the slave already has that entry). I monitor this by examining if the
| error log file mtime has changed, and if so, emailing an error.
| Recently however I found that a mistake had been made the and port 389/tcp
| on the slave had been firewalled. So the replog was growing and no
| replication was taking place. Is it possible for me to detect this type of
| error? I'd like to see a timeout error from slurpd in syslog like
"Unable to
| replicate for X hours." Alternatively, is there a file I can monitor that
| would indicate something is wrong?
| (Yes, monitoring that we can connect to port 389/tcp would solve *this*
| problem, but I'm more concerned with the general case.)
| Basically, I want to be able to easily answer at all times the
question "Is
| replication up and working properly?"

Look in slurpd.status


