[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: RE24 connection code reworking

To: Pierangelo Masarati <ando@sys-net.it>
Subject: Re: RE24 connection code reworking
From: Howard Chu <hyc@symas.com>
Date: Mon, 26 Jan 2009 13:58:53 -0800
Cc: openldap-devel@openldap.org
In-reply-to: <497CEC32.9070302@sys-net.it>
References: <5A4E878BFC21CE6C4913A395@[192.168.1.199]> <497AE927.1030900@sys-net.it> <497AEA8C.9040604@sys-net.it> <497AF211.6070207@sys-net.it> <497B0ACB.3070605@sys-net.it> <497B0CE1.3080705@sys-net.it> <497CD83E.1060806@sys-net.it> <497CEC32.9070302@sys-net.it>
User-agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.9.1b3pre) Gecko/20090115 SeaMonkey/2.0a1pre Firefox/3.0.3

Pierangelo Masarati wrote:

No more failures of this kind; however, now I intermittently get
replication failures:


The problem persists (only once in a while).  It might still be
connection-related, since the logs of server #3, the proxy that pushes
replication to the consumer, are stuffed with tons of
"connection_read(...): no connection!"


What kind of system are you running on? Linux / multiprocessor?

One of the problems with epoll() on Linux is that it wakes up for HANGUP events all the time (they are not selectable in the input options; they're delivered regardless of whether you choose to wait for them or not). This also means we can't shut the notifications off when we acknowledge/act on them. So you'll get lots of repeated wakeups for the same hangup event. The new connection_hangup() function processes these inline for normal connections, but it still falls into the connection_read thread handling for client connections, so their normal cleanup handlers can be invoked. If your server is too busy, it will take a while for the submitted thread to execute, and then you'll get a lot of these spurious messages.

I've been experimenting with epoll's edge-triggered and oneshot modes, which would prevent multiple wakeups occurring for the same event. But unfortunately, when I set that it seems that the events can't be *re-enabled* when we want them, and so slapd hangs. Still looking at this.

But that's beside the point - you shouldn't be seeing any replication failures at all, regardless of connection close handling. What else are you seeing now?

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Follow-Ups:
- Re: RE24 connection code reworking
  - From: Aaron Richton <richton@nbcs.rutgers.edu>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>

References:
- RE24 connection code reworking
  - From: Quanah Gibson-Mount <quanah@zimbra.com>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>
- Re: RE24 connection code reworking
  - From: Pierangelo Masarati <ando@sys-net.it>

Prev by Date: Re: RE24 connection code reworking
Next by Date: Re: RE24 connection code reworking
Index(es):
- Chronological
- Thread