[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5171) hdb txn_checkpoint failures



> Have you got backups from just before these occurrences? Can you see what the 
> last valid transaction log files were before this? Or perhaps you can get 
> some db_stat's off any other slaves that are still running OK? The idea is to 
> see whether the current valid CSNs on an equivalent slave are anywhere near 
> the numbers being logged here, e.g. 1/188113 or 1/8730339.
>
> Have you actually run out of disk space on the partitions holding the logs? 
> It's rather suspicious that two machines would act up at the same time unless 
> some admin specifically disturbed the log files on those two systems at 
> around that time.

I don't have backups for slave bdb logs. The master slapcat output is 
considered sacred data; the slave bdb log files are considered derivable 
thereof and don't get backed up (we'd sooner just replace the entire slave 
if it acts up). The odds of the partitions filling is minimal; Solaris has 
that logged at kern.notice (which on our configuration is serious enough 
to mean a write to NVRAM), and logs that extend prior to September 24 
don't show any such messages.


With that said, "some admin specifically disturbed the log files around 
that time." Logs show that I was the only person in a position to do so 
(unless somebody broke in and covered their tracks; we'll ignore that 
theoretical possibility). On September 24, I reconfigured the slaves to 
use a different IP address to the master instead of the existing 
connection. The times are too coincidental to be unrelated:

(slave4) reconfigured Sep 24 09:41 (first syslog complaint 09:43)
(slave6) reconfigured Sep 24 09:39 (first syslog complaint 09:44)


So...is there something that's cued off the (reverse?) name service 
entries for the master? Does the master IP hash in to a CSN somehow? And 
if this is indeed the case/root cause...well, quite honestly, I think that 
assuming a name service database will remain constant throughout a slapd 
instance is a fallacy. Furthermore, if this is indeed the case, it should 
be absolutely trivial for me to reproduce this (I can perform a DR on 
slave4/6, and reconfigure their network again).

With that in mind, I'll likely test this reproduction early next week. I 
can still get db_stat from all slaves (working and not) at this point if 
that's interesting. Comments?