[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: syncrepl with hardware content switch



On Friday 26 September 2008 03:54:10 Brett @Google wrote:
> Hello,
>
> I was wondering if anybody is using syncrepl in the context of a
> hardware content switch or redundant environment.
>
> That is, having a "hidden" syncrepl master, and 1-N syncrepl clients
> which receive updates from the master, and those client nodes (and
> only those nodes) are visible via a content switch, for the purposes
> of load sharing and redundancy (more predominantly the latter).
>
> I am considering the edge case where a connection is redirected to a
> client, and :
>
> a) client has no current data (new node introduced)
> b) client decides it needs to do a full refresh - perhaps it was down
> and missed a large number of updates

Both these cases can exist to some degree with slurpd, and in the end, the 
requirements (IMHO) are the same. Don't keep an out-of-sync slave in service. 
However, with syncrepl, at least you can much more easily monitor "in-
syncness".

>
> The problem is, that while a replica is b) significantly incomplete or
> a) has no data at all, it should not be given any ldap requests by the
> content switch.
>
> A standard content switch either blindly send connections round-robin
> to nodes 1-N, or if it determines that a server is "listening" (say by
> sending a SYN probe) before it sends through the ldap request. Few
> content switches are smart enough to examine the ldap failure code, as
> most just operate on tcp streams and don't do content inspection, so
> doing ldap content inspection is even less likely.

I don't see how the LDAP result code would help in any case, as there is no 
result code for "Not here, but should be".

>
> So this means that during the time a replica is initializing, and ldap
> requests are going to incorrectly get "no results" where the answer
> should be "not applicable" and the content switch or ldap client
> should have tried again, getting another (already initialized) server.
>
> Ideally (in a content switch environment at least), the ldap server
> should not listen for requests while it is re-synchronising,

An option to start the slapd and only have it answer requests once it is in 
sync has been discussed before ...

> but in
> the case of syncrepl push replication, replication can happens over
> the same port as ldap client requests.
>
> One answer would be if syncrepl could happen over it's own port, as
> then there could then be the option of not accepting (not listening?)
> or refusing connections on the client port, whilst syncrepl is
> (re)building on the syncrepl port.
>
> Alternatively, there could be a "health" port, which only accepted a
> connection and maybe returned "OK" if the replica was "healthy", this
> port could be specified as a "probe" port on the content switch, to
> determine the health of a syncrepl client.
>
> I was just wondering how other people are dealing with this issue and
> thier content switches.

We monitor the replication status of our slaves with network monitoring 
software, which alarms if the slave is more than an hour out of sync. If a 
slave is out of sync for more than an hour, and doesn't recover, we take it 
out of service.

However, we do see circumstances (where some application pushes 50 000 
deletes) where slaves (syncrepl, not delta-syncrepl) take more than an hour to 
catch up. If the load balancer were to take servers out of service 
automatically based on replication state, that would have been an unnecessary 
outage.

In my opinion, leave application monitoring to application/network monitoring 
software, and only have the load balancer do basic "is this service usable" 
monitoring (IOW, at most, do I see the right banner on SMTP/POP3/IMAP). Ensure 
your processes are able to connect those two dots.

I have also seen outages caused by complex probes (e.g. which do pop3 
authentication) and removal/suspension of the account that was used in the 
probe.

Regards,
Buchan