Re: (ITS#5133) Synchronous replication on slave doesn't notice lost network connection

On Thursday 13 September 2007 23:05:29 ando@sys-net.it wrote:
> audrius.valunas@teo.lt wrote:
> > There is synchronous replication between mastyer and slave. When network
> > connectivity problems occur master closes tcp connection but slave
> > doesn't notice those problems, it still has tcp connection open, but in
> > real it is not receiving updates any more.
> > I think that can be solved adding some ack from slave because sending on
> > such a socket would fail and force slave to retry connection.
> Well, this should already be taken into consideration by SO_KEEPALIVE,
> which is always set when available on all connections.  I concur that it
> usually requires quite a long time before a connection is actually
> checked (usually more than 2 hours), so some better policy could be put
> in place.

I think I filed a previous ITS on this, but the servers exhibiting this 
behaviour in a remote site were lost (power supplies died) so I couldn't test 
Howard's fix at the time. We have recently installed some QA servers, which 
now also need to traverse a PIX firewall to get to the production master 
(from which they replicate one database), and I have seen the behaviour again 
(they go out of sync on most of the rare changes to this database until I 
restart them or the check kicks in).

I note that a keepalive probably needs to be sent at least once an hour for a 
PIX not to drop the connection. I haven't looked up any relevant RFCs on this 
though ...

I can now test a fix a lot more easily (since I can upgrade one of these 
servers at-will, as opposed to the previous slaves which were in production).