Full_Name: Kevan Carstensen Version: 2.4.28 OS: gentoo linux URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (134.71.248.23) During a heavy update load, replication from our master to one of our read-only ldap replicas started failing due to locking errors on the replica. Dec 14 04:01:08 pip-dev slapd[12645]: bdb(dc=csupomona,dc=edu): Lock table is out of available lock entries Dec 14 04:01:08 pip-dev slapd[12645]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12) Dec 14 04:01:08 pip-dev slapd[12645]: conn=-1 op=0: attribute "memberUid" index delete failure Dec 14 04:01:08 pip-dev slapd[12645]: null_callback : error code 0x50 Dec 14 04:01:08 pip-dev slapd[12645]: syncrepl_entry: rid=001 be_modify failed (80) Dec 14 04:01:08 pip-dev slapd[12645]: do_syncrepl: rid=001 rc 80 retrying (these are due to an incorrect DB_CONFIG on the replica and are not related to this issue; I include them in case their impact on how/when the replica connects to the master are important) After about ten minutes of the replica trying and failing to start its syncrepl again, the master slapd crashed. Nothing in the master slapd logs specifically references the crash or indicates why it happened. Shortly after the replica broke, some unusual messages started showing up in the logs on the master: Dec 14 03:50:11 fosse slapd[22596]: connection_read(65): no connection! Dec 14 03:50:22 fosse slapd[22596]: send_search_entry: conn 7535053 ber write failed. Dec 14 03:50:32 fosse slapd[22596]: send_search_entry: conn 7535061 ber write failed. Dec 14 03:50:42 fosse slapd[22596]: send_search_entry: conn 7535065 ber write failed. Dec 14 03:50:56 fosse slapd[22596]: send_search_entry: conn 7535068 ber write failed. Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection! Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection! These continued until the crash, persisted after the master slapd was successfully restarted, and went away when I turned the broken replica off. I found similar behavior in a development setup similar to our production setup. There is one master, and there are two read-only replicas configured to replicate from the master with the following syncrepl stanza: syncrepl rid=1 provider=ldaps://master-2.ldap.csupomona.edu/ type=refreshAndPersist retry="10 10 60 +" searchbase="dc=csupomona,dc=edu" bindmethod=simple binddn=cn=slapd-syncrepl,ou=user,ou=service,dc=csupomona,dc=edu credentials=__PASSWORD__ (there are two replicas because I couldn't reliably produce a crash with one replica) The replicas have a DB_CONFIG that causes their replication attempts to fail with the same error as we saw in our production setup. The master database is initialized so that the replicas have a lot of changes to catch up on, as was the case when our production setup failed. Even with loglevel -1, there's nothing specifically referencing the crash in the master slapd's logs. There's an assertion failure in the backtrace: slapd: daemon.c:942: slapd_clr_write: Assertion `((slap_daemon[id].sd_index[(s)]) != -1)' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x7fffea863700 (LWP 27623)] 0x00007ffff6a95175 in raise () from /lib64/libc.so.6 (gdb) bt full #0 0x00007ffff6a95175 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff6a96590 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007ffff6a8e201 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #3 0x000000000042ec9b in slapd_clr_write (s=<value optimized out>, wake=0) at daemon.c:942 id = 0 __PRETTY_FUNCTION__ = "slapd_clr_write" #4 0x000000000043419b in connection_write (s=17) at connection.c:1896 c = <value optimized out> op = <value optimized out> __PRETTY_FUNCTION__ = "connection_write" #5 0x000000000043006a in slapd_daemon_task (ptr=<value optimized out>) at daemon.c:2782 rc = 1 fd = 17 w = 6 ns = 1 at = 0 revents = <value optimized out> tvp = <value optimized out> cat = {tv_sec = 1324320980, tv_usec = 0} i = 0 now = <value optimized out> tv = {tv_sec = 1135, tv_usec = 0} rtask = <value optimized out> l = <value optimized out> last_idle_check = 1324317380 ebadf = 0 tid = <value optimized out> #6 0x00007ffff71ed914 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #7 0x00007ffff6b347dd in clone () from /lib64/libc.so.6 No symbol table info available. The master and replicas in my development setup all run 2.4.28. The production master runs 2.4.26. I successfully reproduced the issue on both 2.4.26 and 2.4.28, but only have a backtrace for 2.4.28. If you want a backtrace for 2.4.26, I can get one. Thanks.
I've investigated this issue a little bit more since my initial bug report. With loglevel conns sync in slapd.conf, the following are the last messages written to the log before slapd crashes: Jan 12 09:33:19 shelley slapd[2482]: Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15 Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=9 active_threads=0 tvp=zero Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=10 active_threads=0 tvp=zero Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on 1 descriptor Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on: Jan 12 09:33:19 shelley slapd[2482]: 15rw Jan 12 09:33:19 shelley slapd[2482]: ber_flush2 failed errno=32 reason="Broken pipe" Jan 12 09:33:19 shelley slapd[2482]: connection_closing: readying conn=1377 sd=15 for close Jan 12 09:33:19 shelley slapd[2482]: send_search_entry: conn 1377 ber write failed. Jan 12 09:33:19 shelley slapd[2482]: connection_resched: attempting closing conn=1377 sd=15 Jan 12 09:33:19 shelley slapd[2482]: daemon: removing 15 Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15 "daemon: write active on 15" comes from line 2660 or so of daemon.c. Here's that with some context: for ( i = 0; nwfds > 0; i++ ) { ber_socket_t wd; if ( ! SLAP_EVENT_IS_WRITE( i ) ) continue; wd = i; SLAP_EVENT_CLR_WRITE( wd ); nwfds--; Debug(LDAP_DEBUG_CONNS, "daemon: write active on %d\n", wd, 0, 0); /* * NOTE: it is possible that the connection was closed and that * the stream is now inactive. connection_write() must validate * the stream is still active. * * ITS#4338: if the stream is invalid, there is no need to close it * here. It has already been closed in connection.c. */ if ( connection_write( wd ) < 0 ) { if ( SLAP_EVENT_IS_READ( wd)) { SLAP_EVENT_CLR_READ( (unsigned) wd); nrfds--; } } } We do some housekeeping on the descriptor, then call connection_write to take care of the write. Here's connection_write from connection.c: int connection_write(ber_socket_t s) { Connection *c; Operation *op; assert( connections != NULL ); slapd_clr_write( s, 0 ); c = connection_get( s ); if( c == NULL ) { Debug( LDAP_DEBUG_ANY, "connection_write(%ld): no connection!\n", (long)s, 0, 0 ); return -1; } [...] connection_get seems to perform the validation that is mentioned by the comment in daemon.c, but connection_get is called after slapd_clr_write, which is the source of the assertion implicated by the backtrace in the initial bug report: void slapd_clr_write( ber_socket_t s, int wake ) { int id = DAEMON_ID(s); ldap_pvt_thread_mutex_lock( &slap_daemon[id].sd_mutex ); if ( SLAP_SOCK_IS_WRITE( id, s )) { assert( SLAP_SOCK_IS_ACTIVE( id, s )); SLAP_SOCK_CLR_WRITE( id, s ); slap_daemon[id].sd_nwriters--; } ldap_pvt_thread_mutex_unlock( &slap_daemon[id].sd_mutex ); WAKE_LISTENER(id,wake); } I'm not sure if connection_write is supposed to validate that a stream is active before or after calling slapd_clr_write, but it seems like the assertion wouldn't be an issue if that validation were performed before calling slapd_clr_write. To test this thought, I rebuilt openldap 2.4.28 with the following patch: --- openldap-2.4.28/servers/slapd/connection.c 2011-11-25 10:52:29.000000000 -0800 +++ openldap-2.4.28-new/servers/slapd/connection.c 2012-01-12 13:35:45.000000000 -0800 @@ -1893,8 +1893,6 @@ assert( connections != NULL ); - slapd_clr_write( s, 0 ); - c = connection_get( s ); if( c == NULL ) { Debug( LDAP_DEBUG_ANY, @@ -1903,6 +1901,8 @@ return -1; } + slapd_clr_write( s, 0 ); + #ifdef HAVE_TLS if ( c->c_is_tls && c->c_needs_tls_accept ) { connection_return( c ); and tried to reproduce the problem under the same circumstances as reported in my initial bug report. The master slapd tolerated the misconfigured replicas for 5 days without crashing; before, it would crash reliably within a half hour or so. I didn't notice any regressions due to the patch, though the master slapd wasn't exposed to a typical workload during the experiment. Any thoughts on this patch? -- Kevan Carstensen <kacarstensen@csupomona.edu> Operating Systems Analyst, I&IT Systems, Cal Poly Pomona
kacarstensen@csupomona.edu wrote: > I've investigated this issue a little bit more since my initial bug > report. > I'm not sure if connection_write is supposed to validate that a stream > is active before or after calling slapd_clr_write, but it seems like the > assertion wouldn't be an issue if that validation were performed before > calling slapd_clr_write. To test this thought, I rebuilt openldap 2.4.28 > with the following patch: > > --- openldap-2.4.28/servers/slapd/connection.c 2011-11-25 10:52:29.000000000 -0800 > +++ openldap-2.4.28-new/servers/slapd/connection.c 2012-01-12 13:35:45.000000000 -0800 > @@ -1893,8 +1893,6 @@ > > assert( connections != NULL ); > > - slapd_clr_write( s, 0 ); > - > c = connection_get( s ); > if( c == NULL ) { > Debug( LDAP_DEBUG_ANY, > @@ -1903,6 +1901,8 @@ > return -1; > } > > + slapd_clr_write( s, 0 ); > + > #ifdef HAVE_TLS > if ( c->c_is_tls&& c->c_needs_tls_accept ) { > connection_return( c ); > > and tried to reproduce the problem under the same circumstances as > reported in my initial bug report. The master slapd tolerated the > misconfigured replicas for 5 days without crashing; before, it would > crash reliably within a half hour or so. I didn't notice any regressions > due to the patch, though the master slapd wasn't exposed to a typical > workload during the experiment. > > Any thoughts on this patch? Sounds OK to me, committed to git master. Thanks. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Test moved from Incoming to Software Bugs
changed notes changed state Test to Release
changed notes changed state Release to Closed
fixed in master fixed in RE24