[Date Prev][Date Next]
(ITS#5454) syncrepl refreshAndPersist stops receiving
Full_Name: Rein Tollevik
Version: CVS head
Submission from: (NULL) (188.8.131.52)
Our persistent syncrepl consumers stops receiving data after a while, with no
indication of why :-( They don't recognize restarts of the producer, so the
only way to get the replication running again is to restart the consumers.
The consumers have a single bdb backend database that is replicated from the
producer, and uses the auditlog overlay on this backend. There are 4 of them,
running in pairs as load-balanced search servers on two sites. The two servers
in each pair are identical configured, and they are all running 64bit solaris8
if that matters.
There are two master servers, one on each site, with a more complicated
configuration. They replicate subordinate backend datebases between each other,
the consumers replicate the glue suffix of these backends. These servers are
running linux in 32- and 64bit mode, and I have not seen the same type of
replication stops between these master servers.
Using netstat it shows that the send queue is full on the producer side, and the
receive window is empty on the consumer, which to me looks as if the consumer
has stopped reading from the provider.
I have used a debugger to look at the servers after they have stopped receiving,
and the syncrepl task is sitting at the end of slapd_rq.task_list, with
next_sched.tv_sec==0 (which mean it will not be scheduled normally?) The
syncrepl is configured with "retry=60 +". The si_conn of the syncinfo_t looks
normal, and its c_sd socket is on slap_daemon.sd_actives fd_set, but not on
sd_readers nor sd_writers. And none of the threads are running any syncrepl
Looking at "slapd -d sync" output and the auditlogs it seem to stop receiving in
the middle of a burst of updates (although that could be a coinsident). The
other slave in the pairs continues to receive updates, so I assume this is a
consumer side problem.
So far it looks to me as if the syncrepl thread has managed to return without
adding the connection socket back to slap_daemon.sd_readers. But whether that
is correct or not, and how it has managed to do it if so, I cannot tell.
Any pointers as to what can be wrong are highly appreciated. I have a core file
from a server that has stopped receiving if there is anything I should look for