[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#3546) Sync rep provider and server crash on SIGTERM



On Thu, 2005-02-17 at 07:30 -0800, Howard Chu wrote:
> I've committed fixes for the consumer-side problem to both CVS HEAD and 
> REL_ENG_2_2. If you get a chance, please try them and follwup to the ITS 
> with your results.

I've applied & tested the changes you made in:
config.c 1.349
slap.h 1.644
syncrepl.c 1.183

Indeed, the consumer assert failure seems to be cured. Thanks!

> I'm less optimistic about patching this provider-side problem. It does 
> not occur in HEAD/2.3. The 2.2 provider code is dead, the HEAD/2.3 code 
> is a complete rewrite. Having looked it over, it would require a bit 
> more restructuring to fix the 2.2 code and I'm hard pressed to get the 
> motivation to fix a dead version of the code. Perhaps your workaround is 
> best after all, despite the resource leak. A slow leak is better than a 
> crash, until you get the opportunity to migrate to 2.3...

Ah, this is the main problem, as I use a provider as a failover ldap
server. So I think I will attempt your fixes and my workaround together
(if that sounds sensible).

The 2.3 code is advertised as "alpha", but is it generally expected to
be reasonably well behaved?

Also, will the workaround be put back into the 2.2 releases so that
others don't experience this problem (as the 2.2 releases are the ones
advertised as stable)?

Cheers,
Martin.

> Martin Evans wrote:
> 
> >On Thu, 2005-02-17 at 04:56 -0800, Howard Chu wrote:
> >  
> >
> >>The backtrace you provided was a bit inaccurate; you need to compile 
> >>with "-g" (debugging info present) and without optimization in order to 
> >>get a consistent trace.
> >>    
> >>
> >
> >Yes, they confused me a bit too... here are some new ones with CFLAGS="g
> >-O0":
> >
> >provider:
> >#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >#4  0x08068c65 in connection2anonymous ()
> >#5  0x080692ec in connection_closing ()
> >#6  0x0806a4b0 in connection_read ()
> >#7  0x0806753f in slapd_daemon_destroy ()
> >#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >#9  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >
> >
> >consumer:
> >#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >#4  0x080fb765 in ldap_next_message ()
> >#5  0x080adaa9 in init_syncrepl ()
> >#6  0x080adeb9 in do_syncrepl ()
> >#7  0x080f6c9a in ldap_pvt_thread_pool_destroy ()
> >#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >#9  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >
> >
> >  
> >
> >>I've reproduced part of the problem; the provider is not segfaulting,
> >>    
> >>
> >
> >Yes, now you point it out nor is mine. I had "ulimit -c unlimited" set
> >on my machine which seems to generate core dumps in this situation. I
> >also get: "Program terminated with signal 6, Aborted." in my gdb output
> >for both core files.
> >
> >  
> >
> >>it is hitting an assert() at connection.c:687. Specifically, the connection 
> >>is being torn down while someone is still waiting to write on it. This 
> >>happens because there is a large search in progress, and data has piled 
> >>up faster than the network can send it. When you terminate the syncrepl 
> >>client, it sends an Unbind request and then closes its side of the 
> >>connection. (In my test, the syncrepl consumer shutdown gracefully 
> >>though, there was no crash.) The Unbind is received by the provider but 
> >>actually gets Deferred, because it's still waiting for its writes to 
> >>flush. Then the connection actually closes, and the problem occurs. This 
> >>provider-side assert() situation is not unique to syncrepl, it can 
> >>happen whenever any large search request is terminated in the middle. 
> >>We'll definitely have to fix that up.
> >>    
> >>
> >
> >
> >Thanks. My logs (level=256) if you need them...
> >
> >Feb 17 15:00:21 mdte slapd[19649]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
> >Feb 17 15:00:21 mdte slapd[19649]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
> >Feb 17 15:00:21 mdte slapd[19649]: bdb_db_init: Initializing BDB database
> >Feb 17 15:00:21 mdte slapd[19650]: slapd starting
> >Feb 17 15:00:23 mdte slapd[19659]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
> >Feb 17 15:00:23 mdte slapd[19659]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
> >Feb 17 15:00:23 mdte slapd[19659]: bdb_db_init: Initializing BDB database
> >Feb 17 15:00:24 mdte slapd[19660]: slapd starting
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 fd=11 ACCEPT from IP=127.0.0.1:33091 (IP=127.0.0.1:11389)
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" method=128
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" mech=SIMPLE ssf=0
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 RESULT tag=97 err=0 text=
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH base="dc=qmul,dc=ac,dc=uk" scope=2 deref=0 filter="(objectClass=*)"
> >Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH attr=* +
> >Feb 17 15:00:31 mdte slapd[19660]: slapd shutdown: waiting for 2 threads to terminate
> >Feb 17 15:00:31 mdte slapd[19650]: connection_input: conn=0 deferring operation: awaiting write
> >
> >  
> >
> >>I'll play with this a bit more to see if I can reproduce the 
> >>consumer-side crash.
> >>    
> >>
> >
> >Thanks.
> >Martin.
> >
> >  
> >
> >>m.d.t.evans@qmul.ac.uk wrote:
> >>
> >>    
> >>
> >>>Full_Name: Martin Evans
> >>>Version: 2.2.23
> >>>OS: Linux mdte 2.6.10-1.766_FC3.mdte30 #1 Tue Feb 15 13:50:26 GMT 2005 i686 i686 i386 GNU/Linux
> >>>URL: ftp://ftp.openldap.org/incoming/
> >>>Submission from: (NULL) (217.42.8.111)
> >>>
> >>>
> >>>While a syncrep consumer being populated, if it is sent TERM signal, both it and
> >>>the provider segfault. This did not happen in 2.2.17 (I havent checked
> >>>intermediate versions). This can be reproduced by removing the consumers bdb
> >>>backend files, starting both the provider and consumer, then sending TERM while
> >>>the consumer is replicating.
> >>>
> >>>My provider has a bdb backend.
> >>>
> >>>My consumer is refreshAndPersist:
> >>>syncrepl rid=140
> >>>        provider=ldap://localhost:11389/
> >>>        type=refreshAndPersist
> >>>        searchbase="<hidden>"
> >>>        filter="(objectClass=*)"
> >>>        scope=sub
> >>>        schemachecking=off
> >>>        updatedn="<hidden>"
> >>>        bindmethod=simple
> >>>        binddn="<hidden>"
> >>>        credentials=<hidden>
> >>>
> >>>For the provider, gdb bt says:
> >>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >>>#4  0x08066ea4 in connection2anonymous ()
> >>>#5  0x08067913 in connection_read ()
> >>>#6  0x08064e67 in slapd_daemon_destroy ()
> >>>#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >>>#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >>>
> >>>And for the consumer:
> >>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> >>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
> >>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
> >>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
> >>>#4  0x080db4e2 in ldap_next_message ()
> >>>#5  0x0809e8a4 in do_syncrepl ()
> >>>#6  0x080d79ef in ldap_int_thread_pool_shutdown ()
> >>>#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
> >>>#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
> >>>
> >>>This might be related to #3534.
> >>>
> >>>Take care,
> >>>Martin.
> >>>
> >>>
> >>>
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>    
> >>
> 
> 
-- 
-- Dr MDT Evans, Computing Services, Queen Mary, University of London