[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: RE24 connection code reworking

Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
No more failures of this kind; however, now I intermittently get
replication failures:

The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"

What kind of system are you running on? Linux / multiprocessor?

One of the problems with epoll() on Linux is that it wakes up for HANGUP events all the time (they are not selectable in the input options; they're delivered regardless of whether you choose to wait for them or not). This also means we can't shut the notifications off when we acknowledge/act on them. So you'll get lots of repeated wakeups for the same hangup event. The new connection_hangup() function processes these inline for normal connections, but it still falls into the connection_read thread handling for client connections, so their normal cleanup handlers can be invoked. If your server is too busy, it will take a while for the submitted thread to execute, and then you'll get a lot of these spurious messages.

I've been experimenting with epoll's edge-triggered and oneshot modes, which would prevent multiple wakeups occurring for the same event. But unfortunately, when I set that it seems that the events can't be *re-enabled* when we want them, and so slapd hangs. Still looking at this.

But that's beside the point - you shouldn't be seeing any replication failures at all, regardless of connection close handling. What else are you seeing now?

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/