[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: syncrepl consumer locks up (ITS#3263)



On Monday 02 August 2004 19:41, Jong-Hyuk wrote:
> A stack trace when it is locked up and/or when a search to syncrepl<rid>
> entry is performed will help to identify the case. It'll also be of great
> help if you send me the syncrepl part of the slapd.conf file.
> - Jong-Hyuk

I finally could reproduce the problem. Following are the steps I took:

- stop provider
- after some time (1 or 2 minutes) restart provider

Now I waited some time (about 5 minutes or more) but the consumer didn't 
reestablish the connection.

- stop consumer
- wait a couple seconds
- restart consumer

The consumer apparently did something and then vanished. I couldn't find a 
core file or anything, also the log isn't very helpful:

Aug  3 13:52:27 panther slapd[6500]: [ID 542995 local4.debug] slapd shutdown: 
waiting for 0 threads to terminate
Aug  3 13:52:28 panther slapd[6500]: [ID 486161 local4.debug] slapd stopped.
Aug  3 13:52:39 panther slapd[6897]: [ID 702911 local4.debug] @(#) $OpenLDAP: 
slapd 2.2.15 (Aug  2 2004 17:32:49) $
Aug  3 13:52:39 panther         
kuenne@gazelle:/usr/local/src/ldap/openldap-2.2.15/servers/slapd
Aug  3 13:52:39 panther slapd[6897]: [ID 527854 local4.debug] bdb_initialize: 
Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
Aug  3 13:52:39 panther last message repeated 1 time
Aug  3 13:52:39 panther slapd[6897]: [ID 294927 local4.debug] bdb_db_init: 
Initializing bdb database
Aug  3 13:52:40 panther slapd[6898]: [ID 100111 local4.debug] slapd starting


That's all I could find.

- start consumer again


Now it runs but if I try to access the syncrepl123 entry it hangs. Following 
are stacktraces for all the threads:

(dbx) where -v
current thread: t@1
=>[1] __lwp_wait(0x4, 0xffbff994, 0x39a94, 0xfefb2cb0, 0x5, 0xffbff92c), at 
0xfef1e748
  [2] lwp_wait(0x4, 0xffbff994, 0x44be4, 0x0, 0x52000, 0xa400), at 0xfefbdd7c
  [3] _thrp_join(0x4, 0x0, 0x0, 0x1, 0x273bcc, 0xffbff994), at 0xfefb9900
  [4] slapd_daemon(0x0, 0x1e1e44, 0x0, 0x0, 0x0, 0x0), at 0x46adc
  [5] main(0x5, 0x2735a8, 0xffbffae8, 0x2cd2f8, 0x1e1c00, 0x0), at 0x3ac94
(dbx) threads
 >    t@1  a  l@1   ?()   running          in  __lwp_wait()
      t@2  a  l@2   reg_thread()   sleep on 0x2740d8  in  __lwp_park()
      t@4  a  l@4   ?()   running          in  _libc_poll()
      t@5  a  l@5   ?()   running          in  ___lwp_cond_wait()
      t@6  a  l@6   ?()   running          in  ___lwp_cond_wait()
      t@7  a  l@7   ?()   sleep on 0x283950  in  __lwp_park()
      t@8  a  l@8   ?()   sleep on 0x283950  in  __lwp_park()
      t@9  a  l@9   ?()   sleep on 0x283950  in  __lwp_park()
     t@10  a l@10   ?()   sleep on 0x283950  in  __lwp_park()
     t@11  a l@11   ?()   sleep on 0x283950  in  __lwp_park()
(dbx) where -v t@2
current thread: t@2
=>[1] __lwp_park(0x0, 0xfedfbe60, 0x0, 0x1, 0x0, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x2740d8, 0xfefd8b88, 0x0, 0x0, 0xfef60200, 0xfefd8000), 
at 0xfefc3230
  [3] cond_wait_common(0x0, 0x275d68, 0xfedfbe60, 0x0, 0x0, 0x410fd148), at 
0xfefc37a8
  [4] _ti_cond_timedwait(0x2740d8, 0x275d68, 0xfedfbf98, 0x0, 0x0, 0x0), at 
0xfefc3c38
  [5] _cond_timedwait_cancel(0x2740d8, 0x275d68, 0xfedfbf98, 0xe10, 0x0, 0x0), 
at 0xfefc3c6c
  [6] slp_dequeue_timed(0x275d88, 0xfedfbf98, 0xfedfbf94, 0x0, 0x0, 0x0), at 
0xff387ec8
  [7] reg_thread(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff3860b0
(dbx) where -v t@3
current thread: t@3
dbx: read of registers from (0xfef60c00) failed -- debugger service failed
(dbx) where -v t@4
current thread: t@4
=>[1] _libc_poll(0xfa3ff2b0, 0x7, 0x1d4c, 0x0, 0x1b58, 0x0), at 0xfef1ca1c
  [2] _libc_select(0x1d4c, 0xfef3f2b0, 0x0, 0xfa3ffee0, 0xfa3ffe60, 0x0), at 
0xfeece980
  [3] select(0x1b, 0xfa3ffe60, 0xfa3ffee0, 0x0, 0xfa3fff60, 0x248f44), at 
0xfefbed74
  [4] 0x456a4(0x23a3fc, 0xfa3fff60, 0x1e73f8, 0x1, 0x8, 0x1b), at 0x456a3
(dbx) where -v t@5
current thread: t@5
=>[1] ___lwp_cond_wait(0x1fe6d49e0, 0x1fe6d49c8, 0x0, 0xffffffffffffffff, 0x0, 
0x0), at 0xfef1e830
  [2] _lwp_cond_wait(0xfe6d49e0, 0xfe6d49c8, 0xffffefd8, 0xfffbd2bc, 
0xfffbd448, 0x0), at 0xfef158ac
  [3] __db_pthread_mutex_lock(0x2e8ea0, 0xfe6d49c8, 0xfe6d49e0, 0x2e8ea0, 0x1, 
0x0), at 0xff24efb8
  [4] __lock_get_internal(0x2e9218, 0x80005abd, 0x0, 0x0, 0x2, 0xfe6d49c8), at 
0xff2f2da4
  [5] __lock_vec(0xf9bff490, 0xfe6f3f40, 0xffffffff, 0xff2f0f04, 0x1, 0x0), at 
0xff2f1058
  [6] bdb_cache_entry_db_relock(0x2e8ea0, 0x80005abd, 0x4cd700, 0x1, 0x0, 
0xf9bff650), at 0xb8a14
  [7] bdb_cache_find_id(0xf9bffb1c, 0x0, 0x2286, 0xf9bff5ec, 0x0, 0xcd), at 
0xb94d4
  [8] bdb_dn2entry(0xf9bffb1c, 0x0, 0x4cd700, 0xf9bff660, 0x0, 0xcd), at 
0xbd4c4
  [9] bdb_entry_get(0xf9bffb1c, 0xf9bffb44, 0x0, 0xf9bff650, 0xf9bff660, 
0xf9bff6c8), at 0xc0a8c
  [10] backend_attribute(0xf9bffb1c, 0x0, 0xf9bffb44, 0x27ddc8, 0xf9bffa84, 
0x0), at 0x55984
  [11] 0x8e95c(0xf9bffb1c, 0x28c590, 0x7b, 0x23bc00, 0x0, 0x3cb1ec), at 
0x8e95b
  [12] do_syncrepl(0x1a, 0x2ebcd0, 0x0, 0x1d9800, 0x273b38, 0x0), at 0x8f958
  [13] 0xec3b0(0x26ebc0, 0xf9bffe20, 0x5, 0x283930, 0x28, 0x283938), at 
0xec3af
(dbx) where -v t@6
current thread: t@6
=>[1] ___lwp_cond_wait(0x1fe6d4eb0, 0x1fe6d4e98, 0x0, 0xffffffffffffffff, 0x0, 
0x0), at 0xfef1e830
  [2] _lwp_cond_wait(0xfe6d4eb0, 0xfe6d4e98, 0x0, 0x0, 0x0, 0x0), at 
0xfef158ac
  [3] __db_pthread_mutex_lock(0x2e8ea0, 0xfe6d4e98, 0xfe6d4eb0, 0x2e8ea0, 0x1, 
0x0), at 0xff24efb8
  [4] __lock_get_internal(0x2e9218, 0xc7, 0x0, 0x0, 0x1, 0xfe6d4e98), at 
0xff2f2da4
  [5] __lock_get_pp(0x2e8ea0, 0xc7, 0x0, 0xf933f6b8, 0x1, 0xf93ff9a4), at 
0xff2f1de8
  [6] 0xb8b28(0x2e8ea0, 0xc7, 0x4cd700, 0x0, 0x0, 0xf93ff9a4), at 0xb8b27
  [7] bdb_cache_find_id(0x4cf068, 0x0, 0x2286, 0xf933f804, 0x0, 0xc7), at 
0xb93f8
  [8] bdb_dn2entry(0x4cf068, 0x0, 0x4cd700, 0xf93ff9b4, 0x1, 0xc7), at 0xbd4c4
  [9] bdb_do_search(0x4cf068, 0xf93ffd84, 0x4cf068, 0x1f, 0x0, 0x0), at 
0x9dc74
  [10] do_search(0x4cf068, 0xf93ffd84, 0x9d744, 0x4cf098, 0x4cf098, 0x0), at 
0x4b2b8
  [11] 0x48da8(0xf93ffe20, 0x4cf068, 0x1d84a8, 0xf93ffd88, 0x37a8f0, 0x63), at 
0x48da7
  [12] 0xec3b0(0x26ebc0, 0xf93ffe20, 0x6, 0x283930, 0x378d48, 0x283938), at 
0xec3af
(dbx) where -v t@7
current thread: t@7
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x1, 0xfefd8000, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x283950, 0xfefd8b88, 0x0, 0x0, 0xfef60400, 0xfefd8000), 
at 0xfefc3230
  [3] _cond_wait_cancel(0x283950, 0x283938, 0x0, 0x0, 0x0, 0x0), at 0xfefc39ec
  [4] pthread_cond_wait(0x283950, 0x283938, 0x10, 0x0, 0x0, 0x42), at 
0xfefc3a28
  [5] 0xec494(0x26ebc0, 0xf8bffe20, 0x7, 0x283930, 0x4cda60, 0x283938), at 
0xec493
(dbx) where -v t@8
current thread: t@8
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x1, 0xfefd8000, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x283950, 0xfefd8b88, 0x0, 0x0, 0xfef60e00, 0xfefd8000), 
at 0xfefc3230
  [3] _cond_wait_cancel(0x283950, 0x283938, 0x0, 0x0, 0x0, 0x0), at 0xfefc39ec
  [4] pthread_cond_wait(0x283950, 0x283938, 0x10, 0x0, 0x0, 0x60), at 
0xfefc3a28
  [5] 0xec494(0x26ebc0, 0xf83ffe20, 0x8, 0x283930, 0xdd5728, 0x283938), at 
0xec493
(dbx) where -v t@9
current thread: t@9
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x1, 0xfefd8000, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x283950, 0xfefd8b88, 0x0, 0x0, 0xfef61000, 0xfefd8000), 
at 0xfefc3230
  [3] _cond_wait_cancel(0x283950, 0x283938, 0x0, 0x0, 0x0, 0x0), at 0xfefc39ec
  [4] pthread_cond_wait(0x283950, 0x283938, 0x10, 0x0, 0x0, 0x60), at 
0xfefc3a28
  [5] 0xec494(0x26ebc0, 0xf7bffe20, 0x9, 0x283930, 0x4cda60, 0x283938), at 
0xec493
(dbx) where -v t@10
current thread: t@10
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x1, 0xfefd8000, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x283950, 0xfefd8b88, 0x0, 0x0, 0xfef61200, 0xfefd8000), 
at 0xfefc3230
  [3] _cond_wait_cancel(0x283950, 0x283938, 0x0, 0x0, 0x0, 0x0), at 0xfefc39ec
  [4] pthread_cond_wait(0x283950, 0x283938, 0x10, 0x0, 0x0, 0x60), at 
0xfefc3a28
  [5] 0xec494(0x26ebc0, 0xf73ffe20, 0xa, 0x283930, 0x4cda60, 0x283938), at 
0xec493
(dbx) where -v t@11
current thread: t@11
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x1, 0xfefd8000, 0x0), at 0xfefc5f88
  [2] cond_wait_queue(0x283950, 0xfefd8b88, 0x0, 0x0, 0xfef61400, 0xfefd8000), 
at 0xfefc3230
  [3] _cond_wait_cancel(0x283950, 0x283938, 0x0, 0x0, 0x0, 0x0), at 0xfefc39ec
  [4] pthread_cond_wait(0x283950, 0x283938, 0x10, 0x0, 0x0, 0x60), at 
0xfefc3a28
  [5] 0xec494(0x26ebc0, 0xf6bffe20, 0xb, 0x283930, 0x4cda60, 0x283938), at 
0xec493


Apparently t@5 is the syncrepl thread and t@6 is my search thread. The 
database locks look like:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object
Locker   Mode      Count Status  ----------------- Object ---------------
      c0 READ          1 HELD    0x33158 len:   5 data: 0x000x00"0x860x00

80005abd WRITE         1 WAIT    0x33158 len:   5 data: 0x000x00"0x860x00

      c7 READ          1 WAIT    0x33158 len:   5 data: 0x000x00"0x860x00


      ba READ          1 HELD    id2entry.bdb             handle        0
      c1 READ          1 HELD    id2entry.bdb             handle        0

      bc READ          1 HELD    dn2id.bdb                handle        0
      c3 READ          1 HELD    dn2id.bdb                handle        0

      c8 READ          1 HELD    objectClass.bdb          handle        0

      d4 READ          1 HELD    uid.bdb                  handle        0

      e0 READ          1 HELD    uidNumber.bdb            handle        0

      cc READ          1 HELD    gidNumber.bdb            handle        0

      da READ          1 HELD    memberUid.bdb            handle        0

      dd READ          1 HELD    automountKey.bdb         handle        0


Other threads are still working, at least the server is still responding to 
searches, just the syncrepl stuff hangs.


Karsten.
-- 
At the source of every error which is blamed on the computer you will
find at least two human errors, including the error of blaming it on
the computer.