[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Upgrade to 2.3.40 -> failed index



On Mon, 4 Feb 2008, Howard Chu wrote:

> That documentation is clearly obsolete, which is why it was removed.

slurpd is obsolete, which is why the section on slurpd was removed from the
2.4 manual. Considering OpenLDAP-2.3.39 is still marked as the stable
release, I can't really see that the 2.3 documentation in its entirety is
obsolete.

> http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/archival.html

Ah, that is the section on backing up/restoring a database, which I suppose
could also be considered the same procedure to be used for copying a
database from one system to another. Given your original wording, I was
looking for something more specifically geared towards copying.

> At a guess, you failed to copy the transaction log files to the slaves.

If I had failed to copy the transaction log files, I don't really see that
it would have worked at all; let alone for almost a year.

Reviewing the backup/restore procedure, I don't really see anything I might
have missed. slapd was not running during the copy, so clearly any updates
were suspended. In fact, slapd had never been run -- the copy was made
immediately after the initial slapadd. There were actually no log files
present. As I mentioned, I have bdb configured to automatically remove
them. Presumably slapadd explicitly/implicitly check pointed upon
completion and they were removed. Even if there was a log file that I
didn't see, the log files were stored in the same directory as the database
files, and I copied the entire directory.

> > Also, even if for some reason the copies on the two slaves were invalid,
> > that would not explain why the master failed. The database on the master
> > was the original database built by slapadd when the server was first put
> > into commission. How could making a copy of it have caused it to fail
> > itself?
>
> Too difficult to guess, given the lack of information. We have only your
> assurance that nothing was done incorrectly, but the facts indicate that at
> least one step was done incorrectly.

The facts only indicate that I had a catastrophic failure. That the failure
was caused by incompetence is only a hypothesis.

I do greatly appreciate your response and willingness to help; I apologize
if I'm getting a bit defensive.

You do have only my assurance that I didn't screw something up. However,
assuming I'm not lying, the facts are:

* openldap 2.3.35 was initially installed on three servers
* on the master server, slapadd was run to load in an existing database
  in ldif format
* the resultant bdb database was then copied to both slaves
* all three were put into production March 2007 and ran perfectly
  under a reasonably heavy load
* a week or so ago I upgraded them to 2.3.40 (stop old server, install
  new server, start new server -- never touching bdb or the existing
  database files)
* they ran fine for at least 3-4 days
* this weekend, they died horribly

Given these facts, if something was done incorrectly, it does not seem
likely that it was failure to copy a transaction log file in March 2007. If
the failure was my own doing, it seems more likely a byproduct of the
upgrade, although I can't think of anything that I could have done wrong
during that process.

At this point, I guess I'll just write it off and hope it doesn't happen
again.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768