[Date Prev][Date Next] [Chronological] [Thread] [Top]

syncrepl broke, connection loss



Hi,

I've loaded my mirror mode setup with data and let it run for a few day,
Both cn=config and the application database is mirrored.
Only server1 is receiving writes from the application.

OpenLDAP 2.4.20, BDB 4.8

After about 6 hours the mirror partly broke and I experience 3 symptoms:

1)
The syncrepl connection from server1->server2 for the application database is missing and data only flows from server1 to server2 - not the other way. The cn=config connections exists.

$ netstat -tna # shows
tcp    0  0 192.168.0.102:636    0.0.0.0:*            LISTEN
tcp 8125  0 192.168.0.102:45535  192.168.0.101:636    ESTABLISHED
tcp    0  0 192.168.0.102:636    192.168.0.101:34954  ESTABLISHED
tcp    0  0 192.168.0.102:45537  192.168.0.101:636    ESTABLISHED

Where it should show, something like:
tcp    0  0 192.168.0.101:636    0.0.0.0:*            LISTEN
tcp    0  0 192.168.0.101:34954  192.168.0.102:636    ESTABLISHED
tcp  261  0 192.168.0.101:33409  192.168.0.102:636    ESTABLISHED
tcp    0  0 192.168.0.101:636    192.168.0.102:45537  ESTABLISHED
tcp    0  0 192.168.0.101:636    192.168.0.102:33226  ESTABLISHED

2)
Meanwhile the log on server1 says:
Dec  8 02:04:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -1 retrying
Dec  8 02:05:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying
Dec  8 02:06:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying
etc...

The first such entry appear around 6 hours after start of the mirror.

3)
If I try to change cn=config with ldapmodify on either server, server1 will hang, not answering queries until I restart it.
For instance, if I do:
----------
dn: cn=config
changetype: modify
replace: olcLogLevel
olcLogLevel: None sync
-----------
... it'l hang.

I was able to connect and search the database on both server, to both servers like (on server1), using client certs: ldapsearch -H ldaps://server2/ -YEXTERNAL -b cn=data,dc=example,dc=com -s sub -D cn=config '(cn=*)' + \*

So it's not that the TCP connection can't be established.
Which make me suspect that this is related to this thread:
http://www.mail-archive.com/openldap-software@openldap.org/msg16028.html

Now after 27 hours the connection finally came back by it self, and replication works both ways.
The "rc -2 retrying" in the log on server1 stopped and was replaced by:

Dec  8 15:39:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying
Dec  8 15:40:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying
Dec 8 15:42:15 server1 slapd[11177]: => bdb_idl_insert_key: c_put id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: conn=15694 op=16: attribute "entryCSN" index delete failure Dec 8 15:47:06 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:06 server1 slapd[11177]: conn=15569 op=36: attribute "entryCSN" index delete failure
... and a bit more of the same.

Trying to modify cn=config with ldapmodify still makes server1 (and ldapmodify) hang though.

/Peter