[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5171) hdb txn_checkpoint failures



richton@nbcs.rutgers.edu wrote:
>> If this is happening even with slapd cleanly shut down then it should also 
>> prevent slapd from restarting, since slapd first attempts to join an existing 
>> environment before trying to create a new one. And that really implies that 
>> the rest of the environment is shot.
> 
> Agreed, but that's a pretty awful condition to have in a long-running 
> slapd process. Without db_stat (easily) working, is there any hope at 
> finding clues as to how this might have happened, or is it just time to 
> rm/slapadd and hope it doesn't happen again?

It doesn't seem like we can get much more info out of this. One more thing to 
try would be a full-debug build of libdb, so we can see exactly where it hangs 
when trying to join the environment. Looking thru the code, I only see one 
mutex to acquire the environment, and looking at your stack trace it's already 
past that location, but the trace could be lying.

Also the mutex used to lock the environment is a regular mutex, not a 
persistent lock. So when all processes have closed the environment, there 
shouldn't be anything left to conflict with here. So most likely the 
environment data structures are hosed, and the thread is locking against 
itself. Again, we can't really tell without single-stepping thru the BDB 
library code. It may not be worth the effort, but that's your call.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/