[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#5835) master slapd dying on lost writes



Full_Name: Quanah Gibson-Mount
Version: 2.3.43
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.111.29.239)


Multiple clients have reported issues with their master servers dying. 
Examination of the logs showed frequent lost connections.  Finally isolating a
server today under GDB where it was occurring frequently I found the following
in the backtrace:

Thread 3 (Thread 1140881760 (LWP 8942)):
#0  0x0000003fca02e21d in raise () from /lib64/tls/libc.so.6
#1  0x0000003fca02fa1e in abort () from /lib64/tls/libc.so.6
#2  0x0000003fca027ae1 in __assert_fail () from /lib64/tls/libc.so.6
#3  0x000000000042b55e in connection_close (c=0x3670030) at connection.c:877
#4  0x000000000042ca18 in connection_read (s=26, cri=0x44006e10) at
connection.c:1458
#5  0x000000000042c1f8 in connection_read_thread (ctx=0x44006e90, argv=0x1a) at
connection.c:1254
#6  0x0000002a956c7c77 in ldap_int_thread_pool_wrapper (xpool=0x8a1f00) at
tpool.c:478
#7  0x0000003fca90610a in start_thread () from /lib64/tls/libpthread.so.0
#8  0x0000003fca0c68c3 in clone () from /lib64/tls/libc.so.6
#9  0x0000000000000000 in ?? ()


The connection.c code in question is:

static void
connection_close( Connection *c )
{
        ber_socket_t    sd = AC_SOCKET_INVALID;

        assert( connections != NULL );
        assert( c != NULL );

        /* ITS#4667 we may have gotten here twice */
        if ( c->c_conn_state == SLAP_C_INVALID )
                return;

        assert( c->c_struct_state == SLAP_C_USED );
        assert( c->c_conn_state == SLAP_C_CLOSING );



Example from stats level logging has:

Nov 29 00:29:41 new slapd[8930]: conn=397 fd=40 closed (connection lost on
write)
Nov 29 00:29:41 new slapd[8930]: conn=335 fd=26 closed (connection lost)
Nov 29 00:29:41 new slapd[8930]: conn=496 op=0 BIND
dn="uid=zimbra,cn=admins,cn=zimbra" method=128
Nov 29 00:29:41 new slapd[8930]: conn=496 op=0 BIND
dn="uid=zimbra,cn=admins,cn=zimbra" mech=SIMPLE ssf=0
Nov 29 00:29:41 new slapd[8930]: conn=496 op=0 RESULT tag=97 err=0 text=
Nov 29 00:29:41 new slapd[8930]: conn=498 fd=26 ACCEPT from
IP=192.168.58.231:45575 (IP=192.168.58.179:389)
Nov 29 00:29:41 new slapd[8930]: connection_read(40): no connection!
Nov 29 00:29:41 new slapd[8930]: conn=498 op=0 BIND
dn="uid=zimbra,cn=admins,cn=zimbra" method=128
Nov 29 00:29:41 new slapd[8930]: conn=498 op=0 BIND
dn="uid=zimbra,cn=admins,cn=zimbra" mech=SIMPLE ssf=0
Nov 29 00:29:41 new slapd[8930]: conn=498 op=0 RESULT tag=97 err=0 text=

As you can see, we lose a connection and then try to read from it (FD 40).  This
is where the log ends because the assert triggered.

This is likely also problematic in OpenLDAP 2.4, as the code hasn't really
changed much there.

--Quanah