Full_Name: Mark Bannister Version: 2.4.30 OS: Solaris 11.3 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (205.228.82.172) Summary of this thread: http://www.openldap.org/lists/openldap-technical/201605/msg00079.html slapadd process on Solaris 11 was hanging for many days (slapd 2.4.30) with the following stack trace: # pstack /var/tmp/cores/core.slapd.0.16189.1463492837 core '/var/tmp/cores/core.slapd.0.16189.1463492837' of 16189: /usr/sbin/slapadd -q -f /etc/openldap/slapd.conf ------------ lwp# 1 / thread# 1 --------------- 07ff7af9 lwp_park (0, 0, 0) 07ff1193 cond_wait_queue (82ed048, 82ed020, 0, 7ff16b0) + 63 07ff1728 __cond_wait (82ed048, 82ed020, f5a685c8, 7ff1771) + 89 07ff177f cond_wait (82ed048, 82ed020, f5a685f8, 7ff17b7) + 27 07ff17cc pthread_cond_wait (82ed048, 82ed020, f5a68618, 7e56137) + 24 07e5614c ldap_pvt_thread_cond_wait (82ed048, 82ed020, f5a68648, 81b7e29) + 24 081b7e7a bdb_tool_entry_close (8e9d398, ffffffff, f5a68698, 82ecc38) + 62 081a38ec slapadd (4, f5a68bbc, f5a68bbc, 80dccb1) + cb8 080dcd2c main (4, f5a68bbc, f5a68bd0, ef60f968) + ac 080dc86d _start (4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0) + 7d ------------ lwp# 2 / thread# 2 --------------- 07ff7af9 lwp_park (0, 0, 0) 07ff1193 cond_wait_queue (8e6b0e4, 8e6b0cc, 0, 7ff16b0) + 63 07ff1728 __cond_wait (8e6b0e4, 8e6b0cc, d7dffdb8, 7ff1771) + 89 07ff177f cond_wait (8e6b0e4, 8e6b0cc, 8e6b0c8, 7ff17b7) + 27 07ff17cc pthread_cond_wait (8e6b0e4, 8e6b0cc, 0, 7e56137) + 24 07e5614c ldap_pvt_thread_cond_wait (8e6b0e4, 8e6b0cc, d7dfffc8, 7e54bfc) + 24 07e54d66 ldap_int_thread_pool_wrapper (8e6b0c8, 803a000, d7dfffe8, 7ff7769) + 242 07ff77bc _thrp_setup (ee1e0240) + 9d 07ff7aa0 _lwp_start (ee1e0240, 0, 0, 0, 0, 0) Had to examine disassembly output as Oracle stripped the debug symbols: > ::walk thread | ::findstack -v stack pointer for thread 1: f5a68538 [ f5a68538 libc_hwcap1.so.1`__lwp_park+0x19() ] f5a68568 libc_hwcap1.so.1`cond_wait_queue+0x63(82ed048, 82ed020, 0, 7ff16b0) f5a685a8 libc_hwcap1.so.1`__cond_wait+0x89(82ed048, 82ed020, f5a685c8, 7ff1771) f5a685c8 libc_hwcap1.so.1`cond_wait+0x27(82ed048, 82ed020, f5a685f8, 7ff17b7) f5a685e8 libc_hwcap1.so.1`pthread_cond_wait+0x24(82ed048, 82ed020, f5a68618, 7e56137) f5a68608 libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait+0x24(82ed048, 82ed020, f5a68648, 81b7e29) f5a686 b bdb_tool_entry_close+0x62(8e9d398, ffffffff, f5a68698, 82ecc38) f5a68ae8 slapadd+0xcb8(4, f5a68bbc, f5a68bbc, 80dccb1) f5a68b88 main+0xac(4, f5a68bbc, f5a68bd0, ef60f968) f5a68bb0 _start+0x7d(4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0) The offending ldap_pvt_thread_cond_wait(), which will never be unblocked because no other thread will do it, is in bdb_tool_entry_close+0x62. Checking the disassembly: > bdb_tool_entry_close::dis bdb_tool_entry_close: pushl %ebp bdb_tool_entry_cle%e+1: movl %esp,%ebp bdb_tool_entry_close+3: pushl %ebx bdb_tool_entry_close+4: pushl %esi bdb_tool_entry_close+5: pushl %edi bdb_tool_entry_close+6: subl $0x1c,%p%p bdb_tool_entry_close+9: andl $0xfffffff0,%esp bdb_tool_entry_close+0xc: call +0x0 <bdb_tool_entry_close+0x11> bdb_tool_entry_close+0x11: popl %ebx bdb_tool_entry_close+0x12: addl $0x1171d7,%ebx bdb_tool_entry_close+0x18: cmpl $0x0,0x1dfa4(%ebx) bdb_tool_entry_close+0x1f: je +0x17e <bdb_tool_entry_close+0x1a3> ... we were at 0x62, so we didn't jump to 0x17e, checking source code: int bdb_tool_entry_close( BackendDB *be ) { if ( bdb_tool_info ) { ... so bdb_tool_info was not a NULL pointer. bdb_tool_entry_close+0x25: movl 0x2dc(%ebx),%eax bdb_tool_entry_close+0x2b: movl $0x1,(%eax) bdb_tool_entry_close+0x31: subl $0xc,%esp bdb_tool_entry_close+0x34: leal 0x1e020(%ebx),%ea0D0D bdb_tool_entry_close+0x3a: pushl %eax bdb_tool_entry_close+0x3b: call -0xdcac0 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_mutex_lock> bdb_tool_entry_close+0x40: addl $0x10,%esp bdb_tool_entry_close+0x43: cmpl $0x0,0x1dfb8(%ebx) bdb_tool_entry_close+0x4a: jne +0x22 <bdb_tool_entry_close+0x6e> bdb_tool_entry_close+0x4c: leal 0x1e048(%ebx),%esi bdb_tool_entry_close+0x52: leal 0x1e020(%ebx),%edi bdb_tool_entry_close+0x58: subl $0x8,%esp bdb_tool_entry_close+0x5b: pushl %edi bdb_tool_entry_close+0x5c: pushl %esi bdb_tool_entry_close+0x5d: call -0xdc882 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait> bdb_tool_entry_close+0x62: addl $0x10,%esp This section of disassembly seems to match up quite nicely with the next section of code: slapd_shutdown = 1; #ifdef USE_TRICKLE ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); /* trickle thread may not have started yet */ while ( !bdb_tool_trickle_active ) ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond_end, &bdb_tool_trickle_mutex ); So bdb_tool_trickle_actives 0%0, and we are blocking on the condition variable bdb_tool_trickle_cond_end. The corresponding call to ldap_pvt_thread_cond_signal occurs in bdb_tool_trickle_task(). The trickle task function looks like this: ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); bdb_tool_trickle_active = 1; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end ); while ( 1 ) { ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond, &bdb_tool_trickle_mutex ); if ( slapd_shutdown ) break; env->memp_trickle( env, 30, &wrote ); } bdb_tool_trickle_active = 0; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end %%3; ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); The cond_signal calls are all contained within one big mutex_lock. So could our thread have reached its mutex_lock *after* the very last invocation of the trickle task function, then blocked on its mutex_lock until the trickle task function returned, allowing our thread to reach a cond_wait which will never get a signal? There seems to be a race condition here. The trickle task is a thread in the pool, submitted by bdb_tool_entry_open():
The full details of the bug seem to have been truncated. See: http://www.openldap.org/lists/openldap-technical/201605/msg00103.html and: http://www.openldap.org/lists/openldap-technical/201605/msg00104.html ... for the conclusion. Best regards, Mark. ________________________________ NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies; do not disclose, use or act upon the information; and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.
Dear Mark, On 05/25/16 2:38 PM, Mark Bannister wrote: > Can anyone shed light on why my slapadd process on Solaris 11 was > hanging (slapd 2.4.30)? I was piping ‘gunzip -c’ through to slapadd > (from an overnight cron job). This usually works fine. I’m guessing > perhaps the gunzip process encountered an error, maybe an EPIPE was > sent, but the stack trace many days later from slapadd doesn’t look > right. Only 2 threads, and both of them in > ldap_pvt_thread_cond_wait(), but no other threads so what are they > waiting for exactly? I've just come across your problem report which you posted to the "openldap-its" mailing list in this year's May: http://www.openldap.org/lists/openldap-technical/201605/threads.html#00079 Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a Solaris 11.3(.11.6.0) host. It is quite interesting that the function "bdb_tool_entry_close" doesn't seem to play a role in my case, but "ldap_pvt_thread_cond_wait" and "ldap_int_thread_pool_wrapper" do seem to. My questions to your address and to the mailing list are: Has the bug/problem been fully analyzed and has it been fixed in the meantime? Mark, did you have another hang of this kind or any similar problems with your OpenLDAP installation? Thank you very much in advance for any helpful hints! Kind regards, Steffen -- ------------------------------------------------------------------------ Dipl.-Inf. Steffen Moser School of Advanced Professional Studies Room: 45.3.110 Ulm University Tel: +49.731.50-32407 Albert-Einstein-Allee 45 Fax: +49.731.50-32409 89081 Ulm, Germany http://saps.uni-ulm.de/
lists@steffen-moser.de wrote: > Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a > Solaris 11.3(.11.6.0) host. Note that 2.4.30 release is more than 4.5 years old. There have been *many* fixes since then. [1] Using free software it's best practice first trying to reproduce the errornous behaviour with a recent release. Otherwise you're wasting developers' spare time. Ciao, Michael. (not a OpenLDAP developer) [1] CHANGES from RE24 branch: http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=CHANGES;h=4914db12b0ab2fe2e4f2a91ea46df224a4532cdc;hb=refs/heads/OPENLDAP_REL_ENG_2_4
Dear Steffen, Apologies for the delay in replying. No I didn't encounter the problem again, at least, I haven't spotted any hanging slapadd processes. But I'm not looking for them either. I think it is a rare race condition. From your description you may well have found a new way to trigger a similar deadlock. I provided a patch, and another suggested way of fixing the issue I encountered, and as you can see from the openldap-technical archive, I had one short reply from Howard on the subject and nothing thereafter. Likewise ITS#8436 has sat in Incoming since it was raised in June without comment. My feeling is that the OpenLDAP code base is now too big for the number of active developers, pushing rare issues like this right to the bottom of their queue. I doubt it'll get fixed. Best regards, Mark. -----Original Message----- From: Steffen Moser [mailto:lists@steffen-moser.de] Sent: 23 October 2016 02:42 To: Bannister, Mark (Enterprise Infrastructure) Cc: openldap-its@OpenLDAP.org Subject: Re: (ITS#8436) slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait Dear Mark, On 05/25/16 2:38 PM, Mark Bannister wrote: > Can anyone shed light on why my slapadd process on Solaris 11 was > hanging (slapd 2.4.30)? I was piping ‘gunzip -c’ through to slapadd > (from an overnight cron job). This usually works fine. I’m guessing > perhaps the gunzip process encountered an error, maybe an EPIPE was > sent, but the stack trace many days later from slapadd doesn’t look > right. Only 2 threads, and both of them in > ldap_pvt_thread_cond_wait(), but no other threads so what are they > waiting for exactly? I've just come across your problem report which you posted to the "openldap-its" mailing list in this year's May: http://www.openldap.org/lists/openldap-technical/201605/threads.html#00079 Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a Solaris 11.3(.11.6.0) host. It is quite interesting that the function "bdb_tool_entry_close" doesn't seem to play a role in my case, but "ldap_pvt_thread_cond_wait" and "ldap_int_thread_pool_wrapper" do seem to. My questions to your address and to the mailing list are: Has the bug/problem been fully analyzed and has it been fixed in the meantime? Mark, did you have another hang of this kind or any similar problems with your OpenLDAP installation? Thank you very much in advance for any helpful hints! Kind regards, Steffen -- ------------------------------------------------------------------------ Dipl.-Inf. Steffen Moser School of Advanced Professional Studies Room: 45.3.110 Ulm University Tel: +49.731.50-32407 Albert-Einstein-Allee 45 Fax: +49.731.50-32409 89081 Ulm, Germany http://saps.uni-ulm.de/ -------------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.
--On Monday, November 07, 2016 10:47 AM +0000 Mark.Bannister@morganstanley.com wrote: Decoded version of Mark's mail: 'Dear Steffen, Apologies for the delay in replying. No I didn't encounter the problem again, at least, I haven't spotted any hanging slapadd processes. But I'm not looking for them either. I think it is a rare race condition. From your de scription you may well have found a new way to trigger a similar deadlock. I provided a patch, and another suggested way of fixing the issue I encountered, and as you can see from the openldap-technical archive, I had one short reply from Howard on the subject and nothing thereafter. Likewise ITS#8436 has sat in Incoming since it was raised in June without comment. My feeling is that the OpenLDAP code base is now too big for the number of active developers, pushing rare issues like this right to the bottom of their queue. I doubt it'll get fixed. Best regards, Mark.' (irrelevant parts removed) --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
changed notes moved from Incoming to Software Bugs
changed notes
Hi Mark, I would note that no patch for this issue has ever been submitted to the tracker, nor is there any IPR, etc. I did track down your suggested fix via the mailing list thread. However, that is not sufficient for it to be included in the OpenLDAP project. Please see <http://www.openldap.org/devel/contributing.html> for the correct procedure for submitting your fix if you would like it to be included. Thanks, Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
> Hi Mark, > > I would note that no patch for this issue has ever been submitted to the tracker, nor is there any IPR, etc. I did track > down your suggested fix via the mailing list thread. However, that is not sufficient for it to be included in the > OpenLDAP project. > > Please see <http://www.openldap.org/devel/contributing.html> for the correct procedure for submitting your fix if you would like it to be included. > > Thanks, > Quanah Hi Quanah, I was reporting a bug, not a patch. There were multiple ways of fixing it. I provided one suggested fix here: http://www.openldap.org/lists/openldap-technical/201605/msg00103.html but quickly followed that up with another idea: http://www.openldap.org/lists/openldap-technical/201605/msg00104.html Of course moving to LMDB would be the best approach, but we're still not there yet, it takes a long time to switch directory format when you're dealing with hundreds of OpenLDAP instances on Solaris 11 (esp. where Oracle have only recently provided a way of upgrading from 2.4.30 to 2.4.44). Best regards, Mark. -------------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.
changed notes moved from Software Bugs to Incoming
back-bdb/hdb are deprecated for 2.4 and removed for 2.5.