[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#8436) slapadd hang in bdb_tool_entry_close / ldap_pvt_thread_cond_wait



Full_Name: Mark Bannister
Version: 2.4.30
OS: Solaris 11.3
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (205.228.82.172)


Summary of this thread:
http://www.openldap.org/lists/openldap-technical/201605/msg00079.html

slapadd process on Solaris 11 was hanging for many days (slapd 2.4.30) with the
following stack trace:

# pstack /var/tmp/cores/core.slapd.0.16189.1463492837
core '/var/tmp/cores/core.slapd.0.16189.1463492837' of 16189:  
/usr/sbin/slapadd -q -f /etc/openldap/slapd.conf
------------  lwp# 1 / thread# 1  ---------------
07ff7af9 lwp_park (0, 0, 0)
07ff1193 cond_wait_queue (82ed048, 82ed020, 0, 7ff16b0) + 63
07ff1728 __cond_wait (82ed048, 82ed020, f5a685c8, 7ff1771) + 89
07ff177f cond_wait (82ed048, 82ed020, f5a685f8, 7ff17b7) + 27
07ff17cc pthread_cond_wait (82ed048, 82ed020, f5a68618, 7e56137) + 24
07e5614c ldap_pvt_thread_cond_wait (82ed048, 82ed020, f5a68648, 81b7e29) + 24
081b7e7a bdb_tool_entry_close (8e9d398, ffffffff, f5a68698, 82ecc38) + 62
081a38ec slapadd  (4, f5a68bbc, f5a68bbc, 80dccb1) + cb8
080dcd2c main     (4, f5a68bbc, f5a68bd0, ef60f968) + ac
080dc86d _start   (4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0) + 7d
------------  lwp# 2 / thread# 2  ---------------
07ff7af9 lwp_park (0, 0, 0)
07ff1193 cond_wait_queue (8e6b0e4, 8e6b0cc, 0, 7ff16b0) + 63
07ff1728 __cond_wait (8e6b0e4, 8e6b0cc, d7dffdb8, 7ff1771) + 89
07ff177f cond_wait (8e6b0e4, 8e6b0cc, 8e6b0c8, 7ff17b7) + 27
07ff17cc pthread_cond_wait (8e6b0e4, 8e6b0cc, 0, 7e56137) + 24
07e5614c ldap_pvt_thread_cond_wait (8e6b0e4, 8e6b0cc, d7dfffc8, 7e54bfc) + 24
07e54d66 ldap_int_thread_pool_wrapper (8e6b0c8, 803a000, d7dfffe8, 7ff7769) +
242
07ff77bc _thrp_setup (ee1e0240) + 9d
07ff7aa0 _lwp_start (ee1e0240, 0, 0, 0, 0, 0)

Had to examine disassembly output as Oracle stripped the debug symbols:

> ::walk thread | ::findstack -v
stack pointer for thread 1: f5a68538
[ f5a68538 libc_hwcap1.so.1`__lwp_park+0x19() ]
  f5a68568 libc_hwcap1.so.1`cond_wait_queue+0x63(82ed048, 82ed020, 0, 7ff16b0)
  f5a685a8 libc_hwcap1.so.1`__cond_wait+0x89(82ed048, 82ed020, f5a685c8,
7ff1771)
  f5a685c8 libc_hwcap1.so.1`cond_wait+0x27(82ed048, 82ed020, f5a685f8, 7ff17b7)
  f5a685e8 libc_hwcap1.so.1`pthread_cond_wait+0x24(82ed048, 82ed020, f5a68618,
7e56137)
  f5a68608 libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait+0x24(82ed048,
82ed020, f5a68648, 81b7e29)
  f5a686 b bdb_tool_entry_close+0x62(8e9d398, ffffffff, f5a68698, 82ecc38)
  f5a68ae8 slapadd+0xcb8(4, f5a68bbc, f5a68bbc, 80dccb1)
  f5a68b88 main+0xac(4, f5a68bbc, f5a68bd0, ef60f968)
  f5a68bb0 _start+0x7d(4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0)

The offending ldap_pvt_thread_cond_wait(), which will never be unblocked because
no other thread will do it, is in bdb_tool_entry_close+0x62.  Checking the
disassembly:

> bdb_tool_entry_close::dis
bdb_tool_entry_close:           pushl  %ebp
bdb_tool_entry_cle%e+1:         movl   %esp,%ebp
bdb_tool_entry_close+3:         pushl  %ebx
bdb_tool_entry_close+4:         pushl  %esi
bdb_tool_entry_close+5:         pushl  %edi
bdb_tool_entry_close+6:         subl   $0x1c,%p%p
bdb_tool_entry_close+9:         andl   $0xfffffff0,%esp
bdb_tool_entry_close+0xc:       call   +0x0     <bdb_tool_entry_close+0x11>
bdb_tool_entry_close+0x11:      popl   %ebx
bdb_tool_entry_close+0x12:      addl   $0x1171d7,%ebx
bdb_tool_entry_close+0x18:      cmpl   $0x0,0x1dfa4(%ebx)
bdb_tool_entry_close+0x1f:      je     +0x17e   <bdb_tool_entry_close+0x1a3>

... we were at 0x62, so we didn't jump to 0x17e, checking source code:

int bdb_tool_entry_close(
        BackendDB *be )
{
        if ( bdb_tool_info ) {

... so bdb_tool_info was not a NULL pointer.

bdb_tool_entry_close+0x25:      movl   0x2dc(%ebx),%eax
bdb_tool_entry_close+0x2b:      movl   $0x1,(%eax)
bdb_tool_entry_close+0x31:      subl   $0xc,%esp
bdb_tool_entry_close+0x34:      leal   0x1e020(%ebx),%ea0D0D
bdb_tool_entry_close+0x3a:      pushl  %eax
bdb_tool_entry_close+0x3b:      call   -0xdcac0
<PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_mutex_lock>
bdb_tool_entry_close+0x40:      addl   $0x10,%esp
bdb_tool_entry_close+0x43:      cmpl   $0x0,0x1dfb8(%ebx)
bdb_tool_entry_close+0x4a:      jne    +0x22    <bdb_tool_entry_close+0x6e>
bdb_tool_entry_close+0x4c:      leal   0x1e048(%ebx),%esi
bdb_tool_entry_close+0x52:      leal   0x1e020(%ebx),%edi
bdb_tool_entry_close+0x58:      subl   $0x8,%esp
bdb_tool_entry_close+0x5b:      pushl  %edi
bdb_tool_entry_close+0x5c:      pushl  %esi
bdb_tool_entry_close+0x5d:      call   -0xdc882
<PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait>
bdb_tool_entry_close+0x62:      addl   $0x10,%esp

This section of disassembly seems to match up quite nicely with the next section
of code:

                slapd_shutdown = 1;
#ifdef USE_TRICKLE
                ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );

                /* trickle thread may not have started yet */
                while ( !bdb_tool_trickle_active )
                        ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond_end,
                                        &bdb_tool_trickle_mutex );

So bdb_tool_trickle_actives 0%0, and we are blocking on the condition variable
bdb_tool_trickle_cond_end.  The corresponding call to
ldap_pvt_thread_cond_signal occurs in bdb_tool_trickle_task().  The trickle task
function looks like this:

        ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );
        bdb_tool_trickle_active = 1;
        ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end );
        while ( 1 ) {
                ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond,
                        &bdb_tool_trickle_mutex );
                if ( slapd_shutdown )
                        break;
                env->memp_trickle( env, 30, &wrote );
        }
        bdb_tool_trickle_active = 0;
        ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end %%3;
        ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex );

The cond_signal calls are all contained within one big mutex_lock.  So could our
thread have reached its mutex_lock *after* the very last invocation of the
trickle task function, then blocked on its mutex_lock until the trickle task
function returned, allowing our thread to reach a cond_wait which will never get
a signal?  There seems to be a race condition here.

The trickle task is a thread in the pool, submitted by bdb_tool_entry_open():