[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#3546) Sync rep provider and server crash on SIGTERM



Since the consumer problem is fixed and the provider appears to be the 
same as ITS#3534 I'd like to dedicate this ITS to just the consumer 
side. We can continue the provider discussion on #3534. That way we can 
close out the consumer issue.

Martin Evans wrote:

>On Thu, 2005-02-17 at 07:30 -0800, Howard Chu wrote:
>  
>
>>I've committed fixes for the consumer-side problem to both CVS HEAD and 
>>REL_ENG_2_2. If you get a chance, please try them and follwup to the ITS 
>>with your results.
>>    
>>
>
>I've applied & tested the changes you made in:
>config.c 1.349
>slap.h 1.644
>syncrepl.c 1.183
>
>Indeed, the consumer assert failure seems to be cured. Thanks!
>
>  
>
>>I'm less optimistic about patching this provider-side problem. It does 
>>not occur in HEAD/2.3. The 2.2 provider code is dead, the HEAD/2.3 code 
>>is a complete rewrite. Having looked it over, it would require a bit 
>>more restructuring to fix the 2.2 code and I'm hard pressed to get the 
>>motivation to fix a dead version of the code. Perhaps your workaround is 
>>best after all, despite the resource leak. A slow leak is better than a 
>>crash, until you get the opportunity to migrate to 2.3...
>>    
>>
>
>Ah, this is the main problem, as I use a provider as a failover ldap
>server. So I think I will attempt your fixes and my workaround together
>(if that sounds sensible).
>
>The 2.3 code is advertised as "alpha", but is it generally expected to
>be reasonably well behaved?
>
>Also, will the workaround be put back into the 2.2 releases so that
>others don't experience this problem (as the 2.2 releases are the ones
>advertised as stable)?
>
>Cheers,
>Martin.
>
>  
>
>>Martin Evans wrote:
>>
>>    
>>
>>>On Thu, 2005-02-17 at 04:56 -0800, Howard Chu wrote:
>>> 
>>>
>>>      
>>>
>>>>The backtrace you provided was a bit inaccurate; you need to compile 
>>>>with "-g" (debugging info present) and without optimization in order to 
>>>>get a consistent trace.
>>>>   
>>>>
>>>>        
>>>>
>>>Yes, they confused me a bit too... here are some new ones with CFLAGS="g
>>>-O0":
>>>
>>>provider:
>>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
>>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
>>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>#4  0x08068c65 in connection2anonymous ()
>>>#5  0x080692ec in connection_closing ()
>>>#6  0x0806a4b0 in connection_read ()
>>>#7  0x0806753f in slapd_daemon_destroy ()
>>>#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>#9  0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>
>>>
>>>consumer:
>>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
>>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
>>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>#4  0x080fb765 in ldap_next_message ()
>>>#5  0x080adaa9 in init_syncrepl ()
>>>#6  0x080adeb9 in do_syncrepl ()
>>>#7  0x080f6c9a in ldap_pvt_thread_pool_destroy ()
>>>#8  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>#9  0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>
>>>
>>> 
>>>
>>>      
>>>
>>>>I've reproduced part of the problem; the provider is not segfaulting,
>>>>   
>>>>
>>>>        
>>>>
>>>Yes, now you point it out nor is mine. I had "ulimit -c unlimited" set
>>>on my machine which seems to generate core dumps in this situation. I
>>>also get: "Program terminated with signal 6, Aborted." in my gdb output
>>>for both core files.
>>>
>>> 
>>>
>>>      
>>>
>>>>it is hitting an assert() at connection.c:687. Specifically, the connection 
>>>>is being torn down while someone is still waiting to write on it. This 
>>>>happens because there is a large search in progress, and data has piled 
>>>>up faster than the network can send it. When you terminate the syncrepl 
>>>>client, it sends an Unbind request and then closes its side of the 
>>>>connection. (In my test, the syncrepl consumer shutdown gracefully 
>>>>though, there was no crash.) The Unbind is received by the provider but 
>>>>actually gets Deferred, because it's still waiting for its writes to 
>>>>flush. Then the connection actually closes, and the problem occurs. This 
>>>>provider-side assert() situation is not unique to syncrepl, it can 
>>>>happen whenever any large search request is terminated in the middle. 
>>>>We'll definitely have to fix that up.
>>>>   
>>>>
>>>>        
>>>>
>>>Thanks. My logs (level=256) if you need them...
>>>
>>>Feb 17 15:00:21 mdte slapd[19649]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
>>>Feb 17 15:00:21 mdte slapd[19649]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
>>>Feb 17 15:00:21 mdte slapd[19649]: bdb_db_init: Initializing BDB database
>>>Feb 17 15:00:21 mdte slapd[19650]: slapd starting
>>>Feb 17 15:00:23 mdte slapd[19659]: @(#) $OpenLDAP: slapd 2.2.23 (Feb 17 2005 14:58:42) $        martin@mdte:/home/martin/tasks/openldap/src/openldap-2.2.23/servers/slapd
>>>Feb 17 15:00:23 mdte slapd[19659]: bdb_back_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
>>>Feb 17 15:00:23 mdte slapd[19659]: bdb_db_init: Initializing BDB database
>>>Feb 17 15:00:24 mdte slapd[19660]: slapd starting
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 fd=11 ACCEPT from IP=127.0.0.1:33091 (IP=127.0.0.1:11389)
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" method=128
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 BIND dn="uid=syncrepl,dc=qmul,dc=ac,dc=uk" mech=SIMPLE ssf=0
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=0 RESULT tag=97 err=0 text=
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH base="dc=qmul,dc=ac,dc=uk" scope=2 deref=0 filter="(objectClass=*)"
>>>Feb 17 15:00:24 mdte slapd[19650]: conn=0 op=1 SRCH attr=* +
>>>Feb 17 15:00:31 mdte slapd[19660]: slapd shutdown: waiting for 2 threads to terminate
>>>Feb 17 15:00:31 mdte slapd[19650]: connection_input: conn=0 deferring operation: awaiting write
>>>
>>> 
>>>
>>>      
>>>
>>>>I'll play with this a bit more to see if I can reproduce the 
>>>>consumer-side crash.
>>>>   
>>>>
>>>>        
>>>>
>>>Thanks.
>>>Martin.
>>>
>>> 
>>>
>>>      
>>>
>>>>m.d.t.evans@qmul.ac.uk wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Full_Name: Martin Evans
>>>>>Version: 2.2.23
>>>>>OS: Linux mdte 2.6.10-1.766_FC3.mdte30 #1 Tue Feb 15 13:50:26 GMT 2005 i686 i686 i386 GNU/Linux
>>>>>URL: ftp://ftp.openldap.org/incoming/
>>>>>Submission from: (NULL) (217.42.8.111)
>>>>>
>>>>>
>>>>>While a syncrep consumer being populated, if it is sent TERM signal, both it and
>>>>>the provider segfault. This did not happen in 2.2.17 (I havent checked
>>>>>intermediate versions). This can be reproduced by removing the consumers bdb
>>>>>backend files, starting both the provider and consumer, then sending TERM while
>>>>>the consumer is replicating.
>>>>>
>>>>>My provider has a bdb backend.
>>>>>
>>>>>My consumer is refreshAndPersist:
>>>>>syncrepl rid=140
>>>>>       provider=ldap://localhost:11389/
>>>>>       type=refreshAndPersist
>>>>>       searchbase="<hidden>"
>>>>>       filter="(objectClass=*)"
>>>>>       scope=sub
>>>>>       schemachecking=off
>>>>>       updatedn="<hidden>"
>>>>>       bindmethod=simple
>>>>>       binddn="<hidden>"
>>>>>       credentials=<hidden>
>>>>>
>>>>>For the provider, gdb bt says:
>>>>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
>>>>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
>>>>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>>>#4  0x08066ea4 in connection2anonymous ()
>>>>>#5  0x08067913 in connection_read ()
>>>>>#6  0x08064e67 in slapd_daemon_destroy ()
>>>>>#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>>>#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>And for the consumer:
>>>>>#0  0x0057f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>>>>#1  0x005bf955 in raise () from /lib/tls/libc.so.6
>>>>>#2  0x005c1319 in abort () from /lib/tls/libc.so.6
>>>>>#3  0x005b8f41 in __assert_fail () from /lib/tls/libc.so.6
>>>>>#4  0x080db4e2 in ldap_next_message ()
>>>>>#5  0x0809e8a4 in do_syncrepl ()
>>>>>#6  0x080d79ef in ldap_int_thread_pool_shutdown ()
>>>>>#7  0x007df3ae in start_thread () from /lib/tls/libpthread.so.0
>>>>>#8  0x0065eb6e in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>This might be related to #3534.
>>>>>
>>>>>Take care,
>>>>>Martin.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>   
>>>>
>>>>        
>>>>
>>    
>>


-- 
  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support