We have 5 servers running OpenLDAP, 001 - 005. Server is CentOS 6.4, LDAP version is openldap-servers-2.4.23-32.el6_4.1.x86_64, current replication topology is:
001 <=> 002
001 <=> 003
001 <=> 004
001 <=> 005
001 is where the phpLDAPAdmin GUI is running on. 002 - 005 are behind a load balancer, 001 is never directly accessed from clients. I understand this makes 001 the single point of failure in terms of replication, but we would like to fix the current issues before exploring more changes.
The issue we are running is intermittent failure in replication. Replication is configured as multi-way master with mirror mode, it always works from 001 to the rest, but sometimes fails the other direction. This is particularly bad when user changes password and it doesn't get replicated to back to 001, and when that happens it doesn't get replicated to the rest of the other servers. In the log we see the following error messages sometimes, but when replication fails sometimes there is no log:
Error Log: Jan 21 10:56:42 001 slapd: do_syncrepl: rid=004 rc -2 retrying (4 retries left)
Another issue is failure on slapd service. On each of the server we have a cronjob running that basically dumps the database using slapcat once an hour. However once every 2 weeks or so we would find slapd dead right around the same time slapcat was run. There is no obvious error in ldap log, system log, or dmesg. According to the documentation it is safe to run slapcat while slapd is running, is this not true?
Below is the replication section of the configuration on 001 and 004. If someone could advise on this it would be very much appreciated.
interval=00:00:00:10 retry="5 5 300 5" timeout=1
* repeat for 003, 004, and 005 *
syncprov-checkpoint 1000 60
index entryCSN,entryUUID eq