[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#3546) Sync rep provider and server crash on SIGTERM



On Thu, 2005-02-17 at 04:56 -0800, Howard Chu wrote:
> The backtrace you provided was a bit inaccurate; you need to compile 
> with "-g" (debugging info present) and without optimization in order to 
> get a consistent trace.

Yes, they confused me a bit too... here are some new ones with CFLAGS="g
-O0":

provider:
#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x005bf955 in raise () from /lib/tls/libc.so.6
#2  0x005c1319 in abort () from /lib/tls/libc.so.6
#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
#4  0x08068c65 in connection2anonymous ()
#5  0x080692ec in connection_closing ()
#6  0x0806a4b0 in connection_read ()
#7  0x0806753f in slapd_daemon_destroy ()
#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
#9  0x0065eb6e in clone () from /lib/tls/libc.so.6


consumer:
#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x005bf955 in raise () from /lib/tls/libc.so.6
#2  0x005c1319 in abort () from /lib/tls/libc.so.6
#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
#4  0x080fb765 in ldap_next_message ()
#5  0x080adaa9 in init_syncrepl ()
#6  0x080adeb9 in do_syncrepl ()
#7  0x080f6c9a in ldap_pvt_thread_pool_destroy ()
#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
#9  0x0065eb6e in clone () from /lib/tls/libc.so.6


> I've reproduced part of the problem; the provider is not segfaulting,

Yes, now you point it out nor is mine. I had "ulimit -c unlimited" set
on my machine which seems to generate core dumps in this situation. I
also get: "Program terminated with signal 6, Aborted." in my gdb output
for both core files.

> it is hitting an assert() at connection.c:687. Specifically, the connection 
> is being torn down while someone is still waiting to write on it. This 
> happens because there is a large search in progress, and data has piled 
> up faster than the network can send it. When you terminate the syncrepl 
> client, it sends an Unbind request and then closes its side of the 
> connection. (In my test, the syncrepl consumer shutdown gracefully 
> though, there was no crash.) The Unbind is received by the provider but 
> actually gets Deferred, because it's still waiting for its writes to 
> flush. Then the connection actually closes, and the problem occurs. This 
> provider-side assert() situation is not unique to syncrepl, it can 
> happen whenever any large search request is terminated in the middle. 
> We'll definitely have to fix that up.


Thanks. My logs (level=256) if you need them...

Feb 17 15:00:21 mdte slapd[19649]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
Feb 17 15:00:21 mdte slapd[19649]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
Feb 17 15:00:21 mdte slapd[19649]: bdb_db_init: Initializing BDB database
Feb 17 15:00:21 mdte slapd[19650]: slapd starting
Feb 17 15:00:23 mdte slapd[19659]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
Feb 17 15:00:23 mdte slapd[19659]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
Feb 17 15:00:23 mdte slapd[19659]: bdb_db_init: Initializing BDB database
Feb 17 15:00:24 mdte slapd[19660]: slapd starting
Feb 17 15:00:24 mdte slapd[19650]: conn=0 fd=11 ACCEPT from IP=127.0.0.1:33091 (IP=127.0.0.1:11389)
Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" method=128
Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" mech=SIMPLE ssf=0
Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 RESULT tag=97 err=0 text=
Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH base="dc=qmul,dc=ac,dc=uk" scope=2 deref=0 filter="(objectClass=*)"
Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH attr=* +
Feb 17 15:00:31 mdte slapd[19660]: slapd shutdown: waiting for 2 threads to terminate
Feb 17 15:00:31 mdte slapd[19650]: connection_input: conn=0 deferring operation: awaiting write

> I'll play with this a bit more to see if I can reproduce the 
> consumer-side crash.

Thanks.
Martin.

> 
> m.d.t.evans@qmul.ac.uk wrote:
> 
> >Full_Name: Martin Evans
> >Version: 2.2.23
> >OS: Linux mdte 2.6.10-1.766_FC3.mdte30 #1 Tue Feb 15 13:50:26 GMT 2005 i686 i686 i386 GNU/Linux
> >URL: ftp://ftp.openldap.org/incoming/
> >Submission from: (NULL) (217.42.8.111)
> >
> >
> >While a syncrep consumer being populated, if it is sent TERM signal, both it and
> >the provider segfault. This did not happen in 2.2.17 (I havent checked
> >intermediate versions). This can be reproduced by removing the consumers bdb
> >backend files, starting both the provider and consumer, then sending TERM while
> >the consumer is replicating.
> >
> >My provider has a bdb backend.
> >
> >My consumer is refreshAndPersist:
> >syncrepl rid=140
> >         provider=ldap://localhost:11389/
> >         type=refreshAndPersist
> >         searchbase="<hidden>"
> >         filter="(objectClass=*)"
> >         scope=sub
> >         schemachecking=off
> >         updatedn="<hidden>"
> >         bindmethod=simple
> >         binddn="<hidden>"
> >         credentials=<hidden>
> >
> >For the provider, gdb bt says:
> >#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >#4  0x08066ea4 in connection2anonymous ()
> >#5  0x08067913 in connection_read ()
> >#6  0x08064e67 in slapd_daemon_destroy ()
> >#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >
> >And for the consumer:
> >#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >#4  0x080db4e2 in ldap_next_message ()
> >#5  0x0809e8a4 in do_syncrepl ()
> >#6  0x080d79ef in ldap_int_thread_pool_shutdown ()
> >#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >
> >This might be related to #3534.
> >
> >Take care,
> >Martin.
> >
> >
> >
> >
> >  
> >
> 
> 
-- 
-- Dr MDT Evans, Computing Services, Queen Mary, University of London