[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5171) hdb txn_checkpoint failures



Aaron Richton wrote:
>> It's still rather suspicious that slave4 and slave6 both had identical log 
>> status for base1 (1/188113) but different requested locations (1/8730339 vs
>> 1/8730401). If they're identically configured slaves then they ought to be in 
>> lock-step. Then again, obviously they're not identical since slave6 doesn't 
>> show base4 in your log.
> 
> Identical is relative. They've got the same OpenLDAP and supporting 
> binaries running on the same patches of Solaris 9 running identical 
> turn-up scripts with identical configuration files. But this is 
> production, so we've got data changes over time. For instance, the slaves 
> bootstrap with a slapadd -q, and the underlying slapcat could easily be 
> different from slave4 vs. slave6 (the most recent one is automatically 
> used). I'd imagine this would look different at the db layer, even once 
> syncrepl eventually converged the logical data?
> 
>> Do you have the db_stat output from an uncorrupted slave? What about the 
>> master?
> 
> Sure... https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl2

Judging from the LSNs in use on these other servers, it sure looks like 
somebody went in and zeroed out your logs on slave4 and slave6. I don't think 
the environment spontaneously corrupted itself and reset the log offsets...

One more thing to check is just using "ls -l" to see if the actual size of the 
log files corresponds with the db_stat offsets. E.g. if slave6 base1's 
log.0000001 is really 8MB but the LSN is only 233KB, then we have to look for 
a weird in-memory corruption. If not, then somebody reset your logs.

-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/