[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#3546) Sync rep provider and server crash on SIGTERM
Since the consumer problem is fixed and the provider appears to be the
same as ITS#3534 I'd like to dedicate this ITS to just the consumer
side. We can continue the provider discussion on #3534. That way we can
close out the consumer issue.
Martin Evans wrote:
>On Thu, 2005-02-17 at 07:30 -0800, Howard Chu wrote:
>
>
>>I've committed fixes for the consumer-side problem to both CVS HEAD and
>>REL_ENG_2_2. If you get a chance, please try them and follwup to the ITS
>>with your results.
>>
>>
>
>I've applied & tested the changes you made in:
>config.c 1.349
>slap.h 1.644
>syncrepl.c 1.183
>
>Indeed, the consumer assert failure seems to be cured. Thanks!
>
>
>
>>I'm less optimistic about patching this provider-side problem. It does
>>not occur in HEAD/2.3. The 2.2 provider code is dead, the HEAD/2.3 code
>>is a complete rewrite. Having looked it over, it would require a bit
>>more restructuring to fix the 2.2 code and I'm hard pressed to get the
>>motivation to fix a dead version of the code. Perhaps your workaround is
>>best after all, despite the resource leak. A slow leak is better than a
>>crash, until you get the opportunity to migrate to 2.3...
>>
>>
>
>Ah, this is the main problem, as I use a provider as a failover ldap
>server. So I think I will attempt your fixes and my workaround together
>(if that sounds sensible).
>
>The 2.3 code is advertised as "alpha", but is it generally expected to
>be reasonably well behaved?
>
>Also, will the workaround be put back into the 2.2 releases so that
>others don't experience this problem (as the 2.2 releases are the ones
>advertised as stable)?
>
>Cheers,
>Martin.
>
>
>
>>Martin Evans wrote:
>>
>>
>>
>>>On Thu, 2005-02-17 at 04:56 -0800, Howard Chu wrote:
>>>
>>>
>>>
>>>
>>>>The backtrace you provided was a bit inaccurate; you need to compile
>>>>with "-g" (debugging info present) and without optimization in order to
>>>>get a consistent trace.
>>>>
>>>>
>>>>
>>>>
>>>Yes, they confused me a bit too... here are some new ones with CFLAGS="g
>>>-O0":
>>>
>>>provider:
>>>#0 0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>#1 0x005bf955 in raise () from /lib/tls/libc.so.6
>>>#2 0x005c1319 in abort () from /lib/tls/libc.so.6
>>>#3 0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>#4 0x08068c65 in connection2anonymous ()
>>>#5 0x080692ec in connection_closing ()
>>>#6 0x0806a4b0 in connection_read ()
>>>#7 0x0806753f in slapd_daemon_destroy ()
>>>#8 0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>#9 0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>
>>>
>>>consumer:
>>>#0 0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>#1 0x005bf955 in raise () from /lib/tls/libc.so.6
>>>#2 0x005c1319 in abort () from /lib/tls/libc.so.6
>>>#3 0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>#4 0x080fb765 in ldap_next_message ()
>>>#5 0x080adaa9 in init_syncrepl ()
>>>#6 0x080adeb9 in do_syncrepl ()
>>>#7 0x080f6c9a in ldap_pvt_thread_pool_destroy ()
>>>#8 0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>#9 0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>
>>>
>>>
>>>
>>>
>>>
>>>>I've reproduced part of the problem; the provider is not segfaulting,
>>>>
>>>>
>>>>
>>>>
>>>Yes, now you point it out nor is mine. I had "ulimit -c unlimited" set
>>>on my machine which seems to generate core dumps in this situation. I
>>>also get: "Program terminated with signal 6, Aborted." in my gdb output
>>>for both core files.
>>>
>>>
>>>
>>>
>>>
>>>>it is hitting an assert() at connection.c:687. Specifically, the connection
>>>>is being torn down while someone is still waiting to write on it. This
>>>>happens because there is a large search in progress, and data has piled
>>>>up faster than the network can send it. When you terminate the syncrepl
>>>>client, it sends an Unbind request and then closes its side of the
>>>>connection. (In my test, the syncrepl consumer shutdown gracefully
>>>>though, there was no crash.) The Unbind is received by the provider but
>>>>actually gets Deferred, because it's still waiting for its writes to
>>>>flush. Then the connection actually closes, and the problem occurs. This
>>>>provider-side assert() situation is not unique to syncrepl, it can
>>>>happen whenever any large search request is terminated in the middle.
>>>>We'll definitely have to fix that up.
>>>>
>>>>
>>>>
>>>>
>>>Thanks. My logs (level=256) if you need them...
>>>
>>>Feb 17 15:00:21 mdte slapd[19649]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $ martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
>>>Feb 17 15:00:21 mdte slapd[19649]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December 3, 2003)
>>>Feb 17 15:00:21 mdte slapd[19649]: bdb_db_init: Initializing BDB database
>>>Feb 17 15:00:21 mdte slapd[19650]: slapd starting
>>>Feb 17 15:00:23 mdte slapd[19659]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $ martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
>>>Feb 17 15:00:23 mdte slapd[19659]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December 3, 2003)
>>>Feb 17 15:00:23 mdte slapd[19659]: bdb_db_init: Initializing BDB database
>>>Feb 17 15:00:24 mdte slapd[19660]: slapd starting
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 fd=11 ACCEPT from IP=127.0.0.1:33091 (IP=127.0.0.1:11389)
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" method=128
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" mech=SIMPLE ssf=0
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 RESULT tag=97 err=0 text=
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH base="dc=qmul,dc=ac,dc=uk" scope=2 deref=0 filter="(objectClass=*)"
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH attr=* +
>>>Feb 17 15:00:31 mdte slapd[19660]: slapd shutdown: waiting for 2 threads to terminate
>>>Feb 17 15:00:31 mdte slapd[19650]: connection_input: conn=0 deferring operation: awaiting write
>>>
>>>
>>>
>>>
>>>
>>>>I'll play with this a bit more to see if I can reproduce the
>>>>consumer-side crash.
>>>>
>>>>
>>>>
>>>>
>>>Thanks.
>>>Martin.
>>>
>>>
>>>
>>>
>>>
>>>>m.d.t.evans@qmul.ac.uk wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Full_Name: Martin Evans
>>>>>Version: 2.2.23
>>>>>OS: Linux mdte 2.6.10-1.766_FC3.mdte30 #1 Tue Feb 15 13:50:26 GMT 2005 i686 i686 i386 GNU/Linux
>>>>>URL: ftp://ftp.openldap.org/incoming/
>>>>>Submission from: (NULL) (217.42.8.111)
>>>>>
>>>>>
>>>>>While a syncrep consumer being populated, if it is sent TERM signal, both it and
>>>>>the provider segfault. This did not happen in 2.2.17 (I havent checked
>>>>>intermediate versions). This can be reproduced by removing the consumers bdb
>>>>>backend files, starting both the provider and consumer, then sending TERM while
>>>>>the consumer is replicating.
>>>>>
>>>>>My provider has a bdb backend.
>>>>>
>>>>>My consumer is refreshAndPersist:
>>>>>syncrepl rid=140
>>>>> provider=ldap://localhost:11389/
>>>>> type=refreshAndPersist
>>>>> searchbase="<hidden>"
>>>>> filter="(objectClass=*)"
>>>>> scope=sub
>>>>> schemachecking=off
>>>>> updatedn="<hidden>"
>>>>> bindmethod=simple
>>>>> binddn="<hidden>"
>>>>> credentials=<hidden>
>>>>>
>>>>>For the provider, gdb bt says:
>>>>>#0 0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>>>#1 0x005bf955 in raise () from /lib/tls/libc.so.6
>>>>>#2 0x005c1319 in abort () from /lib/tls/libc.so.6
>>>>>#3 0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>>>#4 0x08066ea4 in connection2anonymous ()
>>>>>#5 0x08067913 in connection_read ()
>>>>>#6 0x08064e67 in slapd_daemon_destroy ()
>>>>>#7 0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>>>#8 0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>And for the consumer:
>>>>>#0 0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>>>#1 0x005bf955 in raise () from /lib/tls/libc.so.6
>>>>>#2 0x005c1319 in abort () from /lib/tls/libc.so.6
>>>>>#3 0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>>>#4 0x080db4e2 in ldap_next_message ()
>>>>>#5 0x0809e8a4 in do_syncrepl ()
>>>>>#6 0x080d79ef in ldap_int_thread_pool_shutdown ()
>>>>>#7 0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>>>#8 0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>This might be related to #3534.
>>>>>
>>>>>Take care,
>>>>>Martin.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
--
-- Howard Chu
Chief Architect, Symas Corp. Director, Highland Sun
http://www.symas.com http://highlandsun.com/hyc
Symas: Premier OpenSource Development and Support