[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5171) hdb txn_checkpoint failures



richton@nbcs.rutgers.edu wrote:
> Full_Name: Aaron Richton
> Version: 2.3.38
> OS: Solaris 9
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (68.196.250.105)
> 
> 
> Just noticed that my syslog files were growing faster than usual. Upon further
> inspection, two slaves have multiple hdb databases corrupt. Both slave{4,6} have
> been (and are) running slapd since September 4. All are running patched BDB
> 4.2.52 (same binaries I've been using throughout the whole 2.3 series). All
> DB_CONFIGs have DB_LOG_AUTOREMOVE set. Messages similar to below are spewing out
> every checkpoint interval, which is the root cause of my logs growing unusually.
> I'm inclined to just zap all the databases and start again (they're only
> slaves), but figured I'd post for tracking and to ask if there's anything that
> can be grabbed out of the running process before I do so. Curiously enough,
> base4 only corrupted on slave4, not slave6. Additionally, there are other
> databases hosted on each slave that appear unaffected.

Have you got backups from just before these occurrences? Can you see what the 
last valid transaction log files were before this? Or perhaps you can get some 
db_stat's off any other slaves that are still running OK? The idea is to see 
whether the current valid CSNs on an equivalent slave are anywhere near the 
numbers being logged here, e.g. 1/188113 or 1/8730339.

Have you actually run out of disk space on the partitions holding the logs? 
It's rather suspicious that two machines would act up at the same time unless 
some admin specifically disturbed the log files on those two systems at around 
that time.
> 
> 
> The first indication of trouble:
> 
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base1): DB_ENV->log_flush: LSN of 1/8730339 past current end-of-log of
> 1/188113
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base1): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base1): entryCSN.bdb: unable to flush page: 0
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base1): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base2): DB_ENV->log_flush: LSN of 54/1636114 past current end-of-log of
> 4/2981780
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base2): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base2): entryUUID.bdb: unable to flush page: 0
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base2): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base3): DB_ENV->log_flush: LSN of 1/600564 past current end-of-log of 1/662
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base3): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base3): cn.bdb: unable to flush page: 0
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base3): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base4): DB_ENV->log_flush: LSN of 3/2765493 past current end-of-log of
> 1/539
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base4): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base4): uid.bdb: unable to flush page: 0
> Sep 24 09:43:36 slave4.rutgers.edu slapd[295]: [ID 446079 local4.debug]
> bdb(base4): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base1): DB_ENV->log_flush: LSN of 1/8730401 past current end-of-log of
> 1/188113
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base1): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base1): entryCSN.bdb: unable to flush page: 0
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base1): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base2): DB_ENV->log_flush: LSN of 54/1634334 past current end-of-log of
> 4/1649467
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base2): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base2): entryUUID.bdb: unable to flush page: 0
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base2): txn_checkpoint: failed to flush the buffer cache Invalid argument
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base3): DB_ENV->log_flush: LSN of 1/600564 past current end-of-log of 1/538
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base3): Database environment corrupt; the wrong log files may have been
> removed or incompatible database files imported from another environment
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base3): cn.bdb: unable to flush page: 0
> Sep 24 09:44:49 slave6.rutgers.edu slapd[301]: [ID 446079 local4.debug]
> bdb(base3): txn_checkpoint: failed to flush the buffer cache Invalid argument


-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/