[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5171) hdb txn_checkpoint failures



Aaron Richton wrote:
>> itself. Again, we can't really tell without single-stepping thru the BDB 
>> library code. It may not be worth the effort, but that's your call.
> 
> The lock was
> 
> env_region.c:290         MUTEX_LOCK(dbenv, &renv->mutex);
> 
> but that wasn't making much sense....and after a couple minutes in dbx I 
> realized that I've been killing myself with the attempts at db_stat. 
> Yesterday's attempts were running db_* binaries with a wrong (but 
> compatible) ABI. It'd be nice if Sleepycat had some more/earlier checks 
> for that, but oh well...

Kinda figured that that's what happened.

> So anyway, I corrupted base2/slave4 by running the wrong db_stat, but that 
> left three other bases on slave4 and all three bases on slave6. I ran 
> db_stat -l on them, the output is:
> 
> https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl

> BTW, this ABI screwup shouldn't be the root cause of the failures...I 
> haven't tried any db tools until the course of debugging this. These are 
> AUTOREMOVE, so db_archive is unlikely, for instance.

It's still rather suspicious that slave4 and slave6 both had identical log 
status for base1 (1/188113) but different requested locations (1/8730339 vs
1/8730401). If they're identically configured slaves then they ought to be in 
lock-step. Then again, obviously they're not identical since slave6 doesn't 
show base4 in your log.

Do you have the db_stat output from an uncorrupted slave? What about the master?
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/