[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: persistent search and keepalives

The "problem" (I use the term lightly; it's just the situation we have to work with) as I see it is that a persistent search may legitimately have no traffic for quite some time. At Rutgers, we first saw issues with keepalives on slaves that refreshAndPersisted a portion of the DIT reflecting network configuration, which is to say a portion that didn't change that much. The idletimeout on the provider was under two hours (the default TCP keepalive), and libldap wasn't requesting SO_KEEPALIVE in the first place, so there was no accounting for this at the protocol nor application level. Adding it at the protocol level was easy: the simple patch in ITS#4708 was sufficient.

In our case, we then tuned TCP keepalives (on the client) below the provider's idletimeout value, and we got the desired behavior -- the persistent search connection remains available, except in the case of a gross failure.

Admittedly, per your message, these clients are slightly aggressive in the case of failure, and this may not be desirable behavior. But moving forward in the post-ITS#4440 world, I'm not sure how serious this will be. Now that we can give replication DNs their own idletimeout, we should be able to keep the SO_KEEPALIVE connection at the system defaults (two hours), reducing the load in the server down case, and we should no longer lose the connection in the server up case.

...of course, this could all be swept under the carpet with some application layer keepalive, as you discuss. I guess my point is that I'm not sure what we gain. If there is to be any application layer keepalive, I'd be more interested in the refreshAndPersist provider occasionally sending it, since that's the flow that we need working for the next time a MOD hits. But it's kind of wrong to think this is only affecting syncrepl -- it's broad across many LDAP clients, and is probably a deeply ingrained issue. c.f. http://www.openldap.org/lists/openldap-software/200504/msg00445.html. Should application layer keepalives be published as "The Way To Do It"? Would it make sense for this method to be in (an OpenLDAP extension to) libldap for other affected applications to use? Or does it make more sense to just say "OpenLDAP Software depends on the OS/network/firewalls to do their job, make sure they are configured to detect networking failures and pass them upwards, we will reconnect when told to"?

On Mon, 17 Sep 2007, Howard Chu wrote:

Following on from ITS#5133, there are a couple different scenarios to deal with...

1) the remote network segment has disappeared (or the remote server has crashed)
2) an intervening firewall has killed the connection

Neither case is really distinguishable from the consumer side. In the case of a hardware failure, where either the remote host or the network to the host has failed, there's little to be gained by setting an aggressive retry policy. Failures of that sort tend to take a non-trivial amount of time to repair. I've seen some app guides recommending keepalives be sent once a minute or so; to me that is way overdoing things.

In the case of a firewall closing an idle connection, you really have to ask yourself what you're trying to accomplish - are you trying to send probes frequently enough to prevent the connection from closing, or are you just trying to detect that it has closed? This may be giving too much credit to the firewall admins, but I'd guess that they've set an idle timeout that is appropriate for the load that their networks see. Artificially inflating traffic on a connection to prevent it from appearing idle would just be an abuse of network resources. It's also possible that a stateful firewall will start dropping connections because it's been overwhelmed by traffic, and simply doesn't have the memory to track all the live connections. Keeping the connection open in these circumstances would just make a bad situation worse.

As such, it seems to me that you don't really want to be setting very short keepalive timeouts anywhere. The default of 2 hours that most systems use seems pretty reasonable.

On the other hand, it would probably be useful to be able to prod the consumer and have it kick the connection on request. In the past I've implemented this sort of thing using Search requests with magic filters. I.e., treat the Search operation as an RPC call, the target object is simply an embedded method, and the AVAs in the filter comprise a named parameter list.

So e.g. one might do a search on "cn=Sync Consumers,cn=monitor" with filter (|(objectclass=*)(kick=TRUE)) to cause every active consumer to probe its connections.

I like this approach a lot better than Modifying an object, because you can hit many objects at once with a Search request, and receive all of their execution results as attributes of the returned entries.
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/