[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5391) hdb deadlock



richton@nbcs.rutgers.edu wrote:
> Full_Name: Aaron Richton
> Version: 2.3.40
> OS: Solaris 9
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (128.6.31.135)
>
>
> One hdb backend on one slave died ~21:58 yesterday...
>
> current thread: t@5
>    [1] _libc_poll(0xffffffff4f3ff430, 0x0, 0x3e8, 0x0, 0x0, 0x0), at
> 0xffffffff7f0a741c
>    [2] _select(0x3e8, 0xffffffff7f1bc728, 0xffffffff7f1bc728, 0x0,
> 0xffffffff7f1bc728, 0x0), at 0xffffffff7f05a74c
>    [3] select(0x0, 0x0, 0x0, 0x0, 0xffffffff4f3ff5b0, 0x0), at
> 0xffffffff7e0108e8
> =>[4] __os_sleep(dbenv = 0x1005b2610, secs = 1U, usecs = 0), line 84 in
> "os_sleep.c"
>    [5] __memp_sync_int(dbenv = 0x1005b2610, dbmfp = (nil), trickle_max = 0, op =
> DB_SYNC_CACHE, wrotep = (nil)), line 362 in "mp_sync.c"
>    [6] __memp_sync(dbenv = 0x1005b2610, lsnp = (nil)), line 99 in "mp_sync.c"
>    [7] __txn_checkpoint(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U,
> flags = 0), line 1389 in "txn.c"
>    [8] __txn_checkpoint_pp(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U,
> flags = 0), line 1288 in "txn.c"
>    [9] hdb_checkpoint(ctx = 0xffffffff4f3ffc30, arg = 0x1004b4c60), line 165 in
> "config.c"
>    [10] ldap_int_thread_pool_wrapper(xpool = 0x10041e500), line 478 in "tpool.c"
>
> (dbx) where
> current thread: t@16
>    [1] _libc_poll(0xffffffff46ffe3e0, 0x0, 0x3e8, 0x0, 0x0, 0x0), at
> 0xffffffff7f0a741c
>    [2] _select(0x3e8, 0xffffffff7f1bc728, 0xffffffff7f1bc728, 0x0,
> 0xffffffff7f1bc728, 0x0), at 0xffffffff7f05a74c
>    [3] select(0x0, 0x0, 0x0, 0x0, 0xffffffff46ffe560, 0x0), at
> 0xffffffff7e0108e8
> =>[4] __os_sleep(dbenv = 0x1005b2610, secs = 1U, usecs = 0), line 84 in
> "os_sleep.c"
>    [5] __memp_sync_int(dbenv = 0x1005b2610, dbmfp = (nil), trickle_max = 0, op =
> DB_SYNC_CACHE, wrotep = (nil)), line 439 in "mp_sync.c"
>    [6] __memp_sync(dbenv = 0x1005b2610, lsnp = (nil)), line 99 in "mp_sync.c"
>    [7] __txn_checkpoint(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U,
> flags = 0), line 1389 in "txn.c"
>    [8] __txn_checkpoint_pp(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U,
> flags = 0), line 1288 in "txn.c"
>    [9] hdb_delete(op = 0xffffffff46fff618, rs = 0xffffffff46fff088), line 537 in
> "delete.c"
>    [10] syncrepl_entry(si = 0x1004b4e50, op = 0xffffffff46fff618, entry = (nil),
> modlist = 0xffffffff46fff320, syncstate = 3, syncUUID = 0xffffffff46fff3c0,
>                  syncCookie_req = 0xffffffff46fff360, syncCSN =
> 0xffffffff46fff390), line 2006 in "syncrepl.c"
>    [11] do_syncrep2(op = 0xffffffff46fff618, si = 0x1004b4e50), line 731 in
> "syncrepl.c"
>    [12] do_syncrepl(ctx = 0xffffffff46fffc30, arg = 0x1004b5030), line 1095 in
> "syncrepl.c"
>    [13] ldap_int_thread_pool_wrapper(xpool = 0x10041e500), line 478 in "tpool.c"
>
>
> I can't get db_stat to join the environment. If there's anything else that can
> be gleaned from slapd itself, I'd be glad to poke around the core; otherwise,
> I'm off to rm/slapadd...
>
> "This makes sense and shouldn't happen in 2.3.41" would be fine too, but none of
> the changes (to my eye) looked locking related.

Unfortunately no, nothing familiar here. There's nothing in the BDB 
documentation that says two threads are not allowed to call txn_checkpoint 
concurrently, but I suppose it may be excessive to make multiple calls in 
rapid succession.

One thing that I've started doing recently in my configs is to skip the #bytes 
option (leave it zero), so that only time-based checkpoints occur. Since 
they're done in a dedicated task, only one thread at a time can trigger a 
checkpoint.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/