[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Socket-level timeouts?



I think you might be confusing LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT. (Or maybe I am...) But as I recall, NETWORK_TIMEOUT is for initial connect(), and you're referring to ongoing conversations.

For that matter, I'm having a hard time envisioning the situation you describe playing out. Let's say your server dies hard and you reboot it. Then your client, blissfully unaware of this, sends some packets over its open connection. The rebooted server sees the packets, but doesn't have a matching TCP flow, so it's going to tell you to bug off -- I'd expect a "typical OS" to send a TCP reset in response to this. And at that point, libldap should produce LDAP_SERVER_DOWN or something along that flavor, and the client will of course have no bugs and handle this with perfect grace.

Finally, libldap does use TCP keepalive nowadays. In the event of intermediate network path dying hard (which can't be relied upon to nicely produce TCP resets), the underlying keepalive mechanism should pick that up.

On Tue, 8 Apr 2008, Chris Adams wrote:

We've noticed hard failures on both our Linux and Mac workstations when an LDAP server fails in a way which causes it to stop responding but leave a connection open (e.g. lock contention, disk failure). This usually ends up requiring the system to be rebooted because a key system process will probably have made a call which is waiting on a read() which might take days to fail.

I've created a patch simply calls setsockopt() to set SO_SNDTIMEO|SO_RCVTIMEO when LDAP_OPT_NETWORK_TIMEOUT has been set. This appears to produce the desired result on Linux (both with pam_ldap and the ldap utilities) and OS X (within the DirectoryService plugin).

Is there a drawback to this approach which I've missed? It appears that the issue has come up in the past but there's no solution that I can see (certainly nothing else uses socket-level timeouts). I'd like to find a solution for this as it's by far the biggest source of Linux downtime in our environment.

Thanks,
Chris