[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#6276) paused pool can deadlock if writers are waiting



Full_Name: Howard Chu
Version: RE24/HEAD
OS: Solaris 10
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (76.91.220.157)
Submitted by: hyc


test050 hung on me after some number of iterations. Unfortunately I didn't save
the stack traces, but basically there was one thread waiting in send_ldap_ber()
on the write2 cv, and another thread in config_back_add() waiting for a pool
pause to succeed. netstat showed that no connections had queued data, so there
should have been no reason for the writer to still be waiting.

I believe what happened here is that while the writer was waiting (it was a
syncprov qtask replaying events for a psearch) the psearch connection got
closed. Solaris is using select, and select() doesn't specially distinguish
socket close events - they're reported as read events. The deadlock is because
we queue read events into the thread pool, and we don't discover they're
actually closed sockets until the read thread gets to run and tries to read from
the socket (and gets zero bytes back). But since the pool is entering a pause,
the reader thread cannot run, so it can't detect the hangup and dispose of the
waiting writer.

The ideal fix for this is to process hangup events inline in the listener thread
instead of pushing them into the thread pool. But that requires being able to
cheaply determine that a hangup actually occurred, and select() doesn't give us
this information.

We could get this info using poll() instead. Since nowadays any POSIX platform
that implements select() also implements poll() we can probably just switch to
poll() and drop select(). One exception is Windows; Winsock only supports poll()
on Windows Vista and newer.

(Note, we had a patch that added a connection_hangup() handler for Linux epoll()
at one point, but I dropped it later because it seemed to have strange
interactions with Samba. Should look into resurrecting it again.)

I don't think we can really fix this issue without knowing for certain when
hangup events occur. If we're forced to keep using select, that implies that the
main listener thread must attempt a read on the socket before deciding how to
dispatch the connection. Any thoughts?