[Date Prev][Date Next] [Chronological] [Thread] [Top]

Testing for replication failures



While the list is discussing replication, I'd like to bring up the issue of
determining when replication has failed.

Currently, I can only see there being one case I can monitor for: A master
and slave get out of sync and so the master begins producing an error log of
what entries it can't replicate (for example, if the master sends an 'add'
but the slave already has that entry). I monitor this by examining if the
error log file mtime has changed, and if so, emailing an error.

Recently however I found that a mistake had been made the and port 389/tcp
on the slave had been firewalled. So the replog was growing and no
replication was taking place. Is it possible for me to detect this type of
error? I'd like to see a timeout error from slurpd in syslog like "Unable to
replicate for X hours." Alternatively, is there a file I can monitor that
would indicate something is wrong?

(Yes, monitoring that we can connect to port 389/tcp would solve *this*
problem, but I'm more concerned with the general case.)

Basically, I want to be able to easily answer at all times the question "Is
replication up and working properly?"