8436 – slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait

Issue 8436 - slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait

Summary: slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait

Status:	VERIFIED WONTFIX

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.30
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-06-06 14:41 UTC by mark.bannister@morganstanley.com
Modified:	2020-03-22 23:02 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mark.bannister@morganstanley.com 2016-06-06 14:41:37 UTC

Full_Name: Mark Bannister
Version: 2.4.30
OS: Solaris 11.3
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (205.228.82.172)


Summary of this thread:
http://www.openldap.org/lists/openldap-technical/201605/msg00079.html

slapadd process on Solaris 11 was hanging for many days (slapd 2.4.30) with the
following stack trace:

# pstack /var/tmp/cores/core.slapd.0.16189.1463492837
core '/var/tmp/cores/core.slapd.0.16189.1463492837' of 16189:  
/usr/sbin/slapadd -q -f /etc/openldap/slapd.conf
------------  lwp# 1 / thread# 1  ---------------
07ff7af9 lwp_park (0, 0, 0)
07ff1193 cond_wait_queue (82ed048, 82ed020, 0, 7ff16b0) + 63
07ff1728 __cond_wait (82ed048, 82ed020, f5a685c8, 7ff1771) + 89
07ff177f cond_wait (82ed048, 82ed020, f5a685f8, 7ff17b7) + 27
07ff17cc pthread_cond_wait (82ed048, 82ed020, f5a68618, 7e56137) + 24
07e5614c ldap_pvt_thread_cond_wait (82ed048, 82ed020, f5a68648, 81b7e29) + 24
081b7e7a bdb_tool_entry_close (8e9d398, ffffffff, f5a68698, 82ecc38) + 62
081a38ec slapadd  (4, f5a68bbc, f5a68bbc, 80dccb1) + cb8
080dcd2c main     (4, f5a68bbc, f5a68bd0, ef60f968) + ac
080dc86d _start   (4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0) + 7d
------------  lwp# 2 / thread# 2  ---------------
07ff7af9 lwp_park (0, 0, 0)
07ff1193 cond_wait_queue (8e6b0e4, 8e6b0cc, 0, 7ff16b0) + 63
07ff1728 __cond_wait (8e6b0e4, 8e6b0cc, d7dffdb8, 7ff1771) + 89
07ff177f cond_wait (8e6b0e4, 8e6b0cc, 8e6b0c8, 7ff17b7) + 27
07ff17cc pthread_cond_wait (8e6b0e4, 8e6b0cc, 0, 7e56137) + 24
07e5614c ldap_pvt_thread_cond_wait (8e6b0e4, 8e6b0cc, d7dfffc8, 7e54bfc) + 24
07e54d66 ldap_int_thread_pool_wrapper (8e6b0c8, 803a000, d7dfffe8, 7ff7769) +
242
07ff77bc _thrp_setup (ee1e0240) + 9d
07ff7aa0 _lwp_start (ee1e0240, 0, 0, 0, 0, 0)

Had to examine disassembly output as Oracle stripped the debug symbols:

> ::walk thread | ::findstack -v
stack pointer for thread 1: f5a68538
[ f5a68538 libc_hwcap1.so.1`__lwp_park+0x19() ]
  f5a68568 libc_hwcap1.so.1`cond_wait_queue+0x63(82ed048, 82ed020, 0, 7ff16b0)
  f5a685a8 libc_hwcap1.so.1`__cond_wait+0x89(82ed048, 82ed020, f5a685c8,
7ff1771)
  f5a685c8 libc_hwcap1.so.1`cond_wait+0x27(82ed048, 82ed020, f5a685f8, 7ff17b7)
  f5a685e8 libc_hwcap1.so.1`pthread_cond_wait+0x24(82ed048, 82ed020, f5a68618,
7e56137)
  f5a68608 libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait+0x24(82ed048,
82ed020, f5a68648, 81b7e29)
  f5a686 b bdb_tool_entry_close+0x62(8e9d398, ffffffff, f5a68698, 82ecc38)
  f5a68ae8 slapadd+0xcb8(4, f5a68bbc, f5a68bbc, 80dccb1)
  f5a68b88 main+0xac(4, f5a68bbc, f5a68bd0, ef60f968)
  f5a68bb0 _start+0x7d(4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0)

The offending ldap_pvt_thread_cond_wait(), which will never be unblocked because
no other thread will do it, is in bdb_tool_entry_close+0x62.  Checking the
disassembly:

> bdb_tool_entry_close::dis
bdb_tool_entry_close:           pushl  %ebp
bdb_tool_entry_cle%e+1:         movl   %esp,%ebp
bdb_tool_entry_close+3:         pushl  %ebx
bdb_tool_entry_close+4:         pushl  %esi
bdb_tool_entry_close+5:         pushl  %edi
bdb_tool_entry_close+6:         subl   $0x1c,%p%p
bdb_tool_entry_close+9:         andl   $0xfffffff0,%esp
bdb_tool_entry_close+0xc:       call   +0x0     <bdb_tool_entry_close+0x11>
bdb_tool_entry_close+0x11:      popl   %ebx
bdb_tool_entry_close+0x12:      addl   $0x1171d7,%ebx
bdb_tool_entry_close+0x18:      cmpl   $0x0,0x1dfa4(%ebx)
bdb_tool_entry_close+0x1f:      je     +0x17e   <bdb_tool_entry_close+0x1a3>

... we were at 0x62, so we didn't jump to 0x17e, checking source code:

int bdb_tool_entry_close(
        BackendDB *be )
{
        if ( bdb_tool_info ) {

... so bdb_tool_info was not a NULL pointer.

bdb_tool_entry_close+0x25:      movl   0x2dc(%ebx),%eax
bdb_tool_entry_close+0x2b:      movl   $0x1,(%eax)
bdb_tool_entry_close+0x31:      subl   $0xc,%esp
bdb_tool_entry_close+0x34:      leal   0x1e020(%ebx),%ea0D0D
bdb_tool_entry_close+0x3a:      pushl  %eax
bdb_tool_entry_close+0x3b:      call   -0xdcac0
<PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_mutex_lock>
bdb_tool_entry_close+0x40:      addl   $0x10,%esp
bdb_tool_entry_close+0x43:      cmpl   $0x0,0x1dfb8(%ebx)
bdb_tool_entry_close+0x4a:      jne    +0x22    <bdb_tool_entry_close+0x6e>
bdb_tool_entry_close+0x4c:      leal   0x1e048(%ebx),%esi
bdb_tool_entry_close+0x52:      leal   0x1e020(%ebx),%edi
bdb_tool_entry_close+0x58:      subl   $0x8,%esp
bdb_tool_entry_close+0x5b:      pushl  %edi
bdb_tool_entry_close+0x5c:      pushl  %esi
bdb_tool_entry_close+0x5d:      call   -0xdc882
<PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait>
bdb_tool_entry_close+0x62:      addl   $0x10,%esp

This section of disassembly seems to match up quite nicely with the next section
of code:

                slapd_shutdown = 1;
#ifdef USE_TRICKLE
                ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );

                /* trickle thread may not have started yet */
                while ( !bdb_tool_trickle_active )
                        ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond_end,
                                        &bdb_tool_trickle_mutex );

So bdb_tool_trickle_actives 0%0, and we are blocking on the condition variable
bdb_tool_trickle_cond_end.  The corresponding call to
ldap_pvt_thread_cond_signal occurs in bdb_tool_trickle_task().  The trickle task
function looks like this:

        ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );
        bdb_tool_trickle_active = 1;
        ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end );
        while ( 1 ) {
                ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond,
                        &bdb_tool_trickle_mutex );
                if ( slapd_shutdown )
                        break;
                env->memp_trickle( env, 30, &wrote );
        }
        bdb_tool_trickle_active = 0;
        ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end %%3;
        ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex );

The cond_signal calls are all contained within one big mutex_lock.  So could our
thread have reached its mutex_lock *after* the very last invocation of the
trickle task function, then blocked on its mutex_lock until the trickle task
function returned, allowing our thread to reach a cond_wait which will never get
a signal?  There seems to be a race condition here.

The trickle task is a thread in the pool, submitted by bdb_tool_entry_open():

Comment 1 mark.bannister@morganstanley.com 2016-06-06 14:45:14 UTC

The full details of the bug seem to have been truncated.

See:
http://www.openldap.org/lists/openldap-technical/201605/msg00103.html

and:
http://www.openldap.org/lists/openldap-technical/201605/msg00104.html

... for the conclusion.

Best regards,
Mark.


________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies; do not disclose, use or act upon the information; and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

Comment 2 Steffen Moser 2016-10-23 01:42:10 UTC

Dear Mark,

On 05/25/16 2:38 PM, Mark Bannister wrote:
> Can anyone shed light on why my slapadd process on Solaris 11 was
> hanging (slapd 2.4.30)?  I was piping ‘gunzip -c’ through to slapadd
> (from an overnight cron job).  This usually works fine.  I’m guessing
> perhaps the gunzip process encountered an error, maybe an EPIPE was
> sent, but the stack trace many days later from slapadd doesn’t look
> right.  Only 2 threads, and both of them in
> ldap_pvt_thread_cond_wait(), but no other threads so what are they
> waiting for exactly?

I've just come across your problem report which you posted to the
"openldap-its" mailing list in this year's May:

  http://www.openldap.org/lists/openldap-technical/201605/threads.html#00079

Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a
Solaris 11.3(.11.6.0) host.

It is quite interesting that the function "bdb_tool_entry_close" doesn't
seem to play a role in my case, but "ldap_pvt_thread_cond_wait" and
"ldap_int_thread_pool_wrapper" do seem to.

My questions to your address and to the mailing list are: Has the
bug/problem been fully analyzed and has it been fixed in the meantime?
Mark, did you have another hang of this kind or any similar problems
with your OpenLDAP installation?

Thank you very much in advance for any helpful hints!

Kind regards,
Steffen

-- 
------------------------------------------------------------------------
Dipl.-Inf. Steffen Moser
School of Advanced Professional Studies            Room: 45.3.110
Ulm University                                     Tel: +49.731.50-32407
Albert-Einstein-Allee 45                           Fax: +49.731.50-32409
89081 Ulm, Germany                               http://saps.uni-ulm.de/

Comment 3 Michael Ströder 2016-10-23 08:22:11 UTC

lists@steffen-moser.de wrote:
> Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a
> Solaris 11.3(.11.6.0) host.

Note that 2.4.30 release is more than 4.5 years old. There have been *many*
fixes since then. [1]

Using free software it's best practice first trying to reproduce the errornous
behaviour with a recent release. Otherwise you're wasting developers' spare time.

Ciao, Michael. (not a OpenLDAP developer)

[1] CHANGES from RE24 branch:
http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=CHANGES;h=4914db12b0ab2fe2e4f2a91ea46df224a4532cdc;hb=refs/heads/OPENLDAP_REL_ENG_2_4

Comment 4 mark.bannister@morganstanley.com 2016-11-07 10:47:11 UTC

Dear Steffen,

Apologies for the delay in replying.

No I didn't encounter the problem again, at least, I haven't spotted any hanging slapadd processes.  But I'm not looking for them either.  I think it is a rare race condition.  From your description you may well have found a new way to trigger a similar deadlock.

I provided a patch, and another suggested way of fixing the issue I encountered, and as you can see from the openldap-technical archive, I had one short reply from Howard on the subject and nothing thereafter.  Likewise ITS#8436 has sat in Incoming since it was raised in June without comment.

My feeling is that the OpenLDAP code base is now too big for the number of active developers, pushing rare issues like this right to the bottom of their queue.  I doubt it'll get fixed.

Best regards,
Mark.

-----Original Message-----
From: Steffen Moser [mailto:lists@steffen-moser.de] 
Sent: 23 October 2016 02:42
To: Bannister, Mark (Enterprise Infrastructure)
Cc: openldap-its@OpenLDAP.org
Subject: Re: (ITS#8436) slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait

Dear Mark,

On 05/25/16 2:38 PM, Mark Bannister wrote:
> Can anyone shed light on why my slapadd process on Solaris 11 was 
> hanging (slapd 2.4.30)?  I was piping ‘gunzip -c’ through to slapadd 
> (from an overnight cron job).  This usually works fine.  I’m guessing 
> perhaps the gunzip process encountered an error, maybe an EPIPE was 
> sent, but the stack trace many days later from slapadd doesn’t look 
> right.  Only 2 threads, and both of them in 
> ldap_pvt_thread_cond_wait(), but no other threads so what are they 
> waiting for exactly?

I've just come across your problem report which you posted to the "openldap-its" mailing list in this year's May:

  http://www.openldap.org/lists/openldap-technical/201605/threads.html#00079

Today, I encountered a similar hang of our OpenLDAP 2.4.30 server on a Solaris 11.3(.11.6.0) host.

It is quite interesting that the function "bdb_tool_entry_close" doesn't seem to play a role in my case, but "ldap_pvt_thread_cond_wait" and "ldap_int_thread_pool_wrapper" do seem to.

My questions to your address and to the mailing list are: Has the bug/problem been fully analyzed and has it been fixed in the meantime?
Mark, did you have another hang of this kind or any similar problems with your OpenLDAP installation?

Thank you very much in advance for any helpful hints!

Kind regards,
Steffen

--
------------------------------------------------------------------------
Dipl.-Inf. Steffen Moser
School of Advanced Professional Studies            Room: 45.3.110
Ulm University                                     Tel: +49.731.50-32407
Albert-Einstein-Allee 45                           Fax: +49.731.50-32409
89081 Ulm, Germany                               http://saps.uni-ulm.de/

--------------------------------------------------------------------------------

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers  If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.

Comment 5 Quanah Gibson-Mount 2016-11-07 21:32:37 UTC

--On Monday, November 07, 2016 10:47 AM +0000 
Mark.Bannister@morganstanley.com wrote:

Decoded version of Mark's mail:
'Dear Steffen,

Apologies for the delay in replying.

No I didn't encounter the problem again, at least, I haven't spotted any 
hanging slapadd processes.  But I'm not looking for them either.  I think 
it is a rare race condition.  From your de
scription you may well have found a new way to trigger a similar deadlock.

I provided a patch, and another suggested way of fixing the issue I 
encountered, and as you can see from the openldap-technical archive, I had 
one short reply from Howard on the subject and
 nothing thereafter.  Likewise ITS#8436 has sat in Incoming since it was 
raised in June without comment.

My feeling is that the OpenLDAP code base is now too big for the number of 
active developers, pushing rare issues like this right to the bottom of 
their queue.  I doubt it'll get fixed.

Best regards,
Mark.'

(irrelevant parts removed)

--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 6 Quanah Gibson-Mount 2017-04-14 19:56:25 UTC

changed notes
moved from Incoming to Software Bugs

Comment 7 Quanah Gibson-Mount 2017-09-06 23:41:31 UTC

changed notes

Comment 8 Quanah Gibson-Mount 2017-09-07 16:04:22 UTC

Hi Mark,

I would note that no patch for this issue has ever been submitted to the 
tracker, nor is there any IPR, etc.  I did track down your suggested fix 
via the mailing list thread.  However, that is not sufficient for it to be 
included in the OpenLDAP project.

Please see <http://www.openldap.org/devel/contributing.html> for the 
correct procedure for submitting your fix if you would like it to be 
included.

Thanks,
Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Comment 9 Quanah Gibson-Mount 2017-09-07 16:23:41 UTC

changed notes

Comment 10 mark.bannister@morganstanley.com 2017-09-08 07:26:37 UTC

> Hi Mark,
> 
> I would note that no patch for this issue has ever been submitted to the tracker, nor is there any IPR, etc.  I did track
> down your suggested fix via the mailing list thread.  However, that is not sufficient for it to be included in the
> OpenLDAP project.
>
> Please see <http://www.openldap.org/devel/contributing.html> for the correct procedure for submitting your fix if you would like it to be included.
> 
> Thanks,
> Quanah

Hi Quanah,

I was reporting a bug, not a patch.  There were multiple ways of fixing it.  I provided one suggested fix here:

http://www.openldap.org/lists/openldap-technical/201605/msg00103.html

but quickly followed that up with another idea:

http://www.openldap.org/lists/openldap-technical/201605/msg00104.html

Of course moving to LMDB would be the best approach, but we're still not there yet, it takes a long time to switch directory format when you're dealing with hundreds of OpenLDAP instances on Solaris 11 (esp. where Oracle have only recently provided a way of upgrading from 2.4.30 to 2.4.44).

Best regards,
Mark.


--------------------------------------------------------------------------------

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers  If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.

Comment 11 Quanah Gibson-Mount 2019-06-17 17:40:17 UTC

changed notes
moved from Software Bugs to Incoming

Comment 12 Quanah Gibson-Mount 2020-03-22 23:02:32 UTC

back-bdb/hdb are deprecated for 2.4 and removed for 2.5.