[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#7113) broken read-only replica results in assertion failure and crash on master slapd process



I've investigated this issue a little bit more since my initial bug
report.

With loglevel conns sync in slapd.conf, the following are the last
messages written to the log before slapd crashes:

  Jan 12 09:33:19 shelley slapd[2482]:  
  Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15 
  Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=9 active_threads=0 tvp=zero 
  Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=10 active_threads=0 tvp=zero 
  Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on 1 descriptor 
  Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on: 
  Jan 12 09:33:19 shelley slapd[2482]:  15rw 
 
  Jan 12 09:33:19 shelley slapd[2482]: ber_flush2 failed errno=32 reason="Broken pipe" 
  Jan 12 09:33:19 shelley slapd[2482]: connection_closing: readying conn=1377 sd=15 for close 
  Jan 12 09:33:19 shelley slapd[2482]: send_search_entry: conn 1377  ber write failed. 
  Jan 12 09:33:19 shelley slapd[2482]: connection_resched: attempting closing conn=1377 sd=15 
  Jan 12 09:33:19 shelley slapd[2482]: daemon: removing 15 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! 
  Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error 
  Jan 12 09:33:19 shelley slapd[2482]:  
  Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15

"daemon: write active on 15" comes from line 2660 or so of daemon.c.
Here's that with some context:

    for ( i = 0; nwfds > 0; i++ ) {
        ber_socket_t wd;
        if ( ! SLAP_EVENT_IS_WRITE( i ) ) continue;
        wd = i;

        SLAP_EVENT_CLR_WRITE( wd );
        nwfds--;

        Debug(LDAP_DEBUG_CONNS,
            "daemon: write active on %d\n",
            wd, 0, 0);

        /*
         * NOTE: it is possible that the connection was closed and that
         * the stream is now inactive.  connection_write() must validate
         * the stream is still active.
         *
         * ITS#4338: if the stream is invalid, there is no need to close it
         * here.  It has already been closed in connection.c.
         */
        if ( connection_write( wd ) < 0 ) {
            if ( SLAP_EVENT_IS_READ( wd)) {
                SLAP_EVENT_CLR_READ( (unsigned) wd);
                nrfds--;
            }
        }
    }

We do some housekeeping on the descriptor, then call connection_write to
take care of the write. Here's connection_write from connection.c:

  int connection_write(ber_socket_t s)
  {
      Connection *c;
      Operation *op;
  
      assert( connections != NULL );
  
      slapd_clr_write( s, 0 );

      c = connection_get( s );
      if( c == NULL ) {
          Debug( LDAP_DEBUG_ANY,
              "connection_write(%ld): no connection!\n",
              (long)s, 0, 0 );
          return -1;
      }
      [...]

connection_get seems to perform the validation that is mentioned by the
comment in daemon.c, but connection_get is called after slapd_clr_write,
which is the source of the assertion implicated by the backtrace in the
initial bug report:

  void
  slapd_clr_write( ber_socket_t s, int wake )
  {
    int id = DAEMON_ID(s);
    ldap_pvt_thread_mutex_lock( &slap_daemon[id].sd_mutex );
  
    if ( SLAP_SOCK_IS_WRITE( id, s )) {
        assert( SLAP_SOCK_IS_ACTIVE( id, s ));
  
        SLAP_SOCK_CLR_WRITE( id, s );
        slap_daemon[id].sd_nwriters--;
    }
  
    ldap_pvt_thread_mutex_unlock( &slap_daemon[id].sd_mutex );
    WAKE_LISTENER(id,wake);
  }

I'm not sure if connection_write is supposed to validate that a stream
is active before or after calling slapd_clr_write, but it seems like the
assertion wouldn't be an issue if that validation were performed before
calling slapd_clr_write. To test this thought, I rebuilt openldap 2.4.28
with the following patch:

  --- openldap-2.4.28/servers/slapd/connection.c    2011-11-25 10:52:29.000000000 -0800
  +++ openldap-2.4.28-new/servers/slapd/connection.c    2012-01-12 13:35:45.000000000 -0800
  @@ -1893,8 +1893,6 @@
   
    assert( connections != NULL );
   
  - slapd_clr_write( s, 0 );
  -
    c = connection_get( s );
    if( c == NULL ) {
        Debug( LDAP_DEBUG_ANY,
  @@ -1903,6 +1901,8 @@
        return -1;
    }
   
  + slapd_clr_write( s, 0 );
  +
   #ifdef HAVE_TLS
    if ( c->c_is_tls && c->c_needs_tls_accept ) {
        connection_return( c );

and tried to reproduce the problem under the same circumstances as
reported in my initial bug report. The master slapd tolerated the
misconfigured replicas for 5 days without crashing; before, it would
crash reliably within a half hour or so. I didn't notice any regressions
due to the patch, though the master slapd wasn't exposed to a typical
workload during the experiment.

Any thoughts on this patch?
-- 
Kevan Carstensen                        <kacarstensen@csupomona.edu>
Operating Systems Analyst, I&IT Systems, Cal Poly Pomona