[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Corruption of Index files running readonly slapd (ITS#2582)



This sounds familiar to me, too.  I've experienced a similar scenario until
I cron'ed db_checkpoint.  For whatever reason, I never fully got the
internal checkpointing working properly.  I introduced an archive directory
to the mix and periodically move the old log files out of the DB_CONFIG
defined log directory.  This keeps performance consistent, for me.  Smaller
servers would probably only need to cron this once a day, or week, (I run it
hourly) but this workaround is what I use in production (Solaris):

-- cut --
#!/bin/bash

# This script monitors and fixes the health of the BDB files. -jjt 20030202

BDBDIR=/openldap/data/replica/bdb.replica-3890
LOGDIR=/openldap/journal/bdb.replica-3890.log
ARCDIR=/openldap/journal/bdb.replica-3890.log.archive
LOGFILE=/openldap/journal/log/dbrx-replica-3890.log
TMPDIR=/tmp
LOCKFILE=/tmp/dbrx-replica-3890.lock

if [ -f "$LOCKFILE" ]
then
        exit 9
fi

touch $LOCKFILE

if [ "$LOGNAME" = "root" ]
then
        echo "`date`: Root Enabled." >> $LOGFILE
        cd $BDBDIR
        BEFORE=`db_archive`
        for BB in $BEFORE
        do
                echo "`date`: Moving (before): $BB" >> $LOGFILE
                mv $LOGDIR/$BB $ARCDIR
        done
        if [ "$1" != "-n" ]
        then
                db_checkpoint -1v >> $LOGFILE
        fi
        AFTER=`db_archive`
        for AA in $AFTER
        do
                echo "`date`: Moving (after): $AA" >> $LOGFILE
                mv $LOGDIR/$AA $ARCDIR
        done
else
        echo "You must be root to use this script. -jjt"
        rm $LOCKFILE
        exit 9
fi

rm $LOCKFILE
-- cut --

Joseph
----- Original Message ----- 
From: "Howard Chu" <hyc@highlandsun.com>
To: "'Christoph Neerfeld'" <Christoph.Neerfeld@fh-bonn-rhein-sieg.de>;
<openldap-bugs@OpenLDAP.org>
Sent: Thursday, June 26, 2003 5:59 AM
Subject: RE: Corruption of Index files running readonly slapd (ITS#2582)


> Given the information you've provided, this still sounds like either the
BDB
> cache is inadequate or there are stale locks in the way. Since the lock
> information is recorded in the __db.00* environment files, deleting them
all
> will also remove the locks. However, there's not enough information here
to
> tell that for certain.
>
> The next time you see this slowdown occur, shutdown the slapd and record
all
> of the information you can get out of db_stat:
> db_stat -c (lock info)
> db_stat -l (logging info)
> db_stat -m (memory usage)
> db_stat -t (transactions)
>
> In particular, with slapd cleanly shut down, in the output of "db_stat -c"
> you should see zero current locks, lockers, and lock objects. If any of
those
> are non-zero, we may have a locking bug, or there is a locking bug in the
BDB
> library. In the output of "db_stat -m" you should look at the number of
clean
> and dirty pages forced from the cache. These numbers should be small,
> preferably zero. If they are non-zero then your cache is probably too
small.
> In the output of "db_stat -l" look at the number of region locks granted
> after waiting, it should be zero or very small. In the output of
"db_stat -t"
> the number of active transactions should be zero. If not, there is a bug
> somewhere. The number of aborted transactions should be zero or very
small,
> assuming that your usage patterns are primarily read-oriented. The number
of
> maximum active transactions should be much smaller than the maximum active
> transactions possible. If not, then you need to reconfigure the
transaction
> region.
>
> It's better to use the db_recover command than to manually delete the
> __db.00* files. Usually, if slapd has shut down cleanly, the effect will
be
> the same, but if slapd was shutdown uncleanly, the db_recover command will
> flush the cache and make sure that the last committed transactions
actually
> make it into the database.
>
> Unless you see non-zero values for currently active lockers or
transactions,
> it's unlikely that this is an OpenLDAP bug. Also, a lock management bug in
> OpenLDAP would most likely cause slapd to hang and stop answering queries,
> not just make it run slowly. If there is no indication of this type of
bug,
> then you have a badly configured database, and you need to read the
SleepyCat
> documentation to resolve the problem. Finally, even if there's an errant
> locker hanging around out there, it may just be a leftover from an unclean
> system shutdown, and not actually a misplaced lock. We've been discussing
> approaches to prevent this problem on the -devel list; the issue was first
> mentioned in ITS#2502 and any action taken will be reported there.
>
>   -- Howard Chu
>   Chief Architect, Symas Corp.       Director, Highland Sun
>   http://www.symas.com               http://highlandsun.com/hyc
>   Symas: Premier OpenSource Development and Support
>
> > -----Original Message-----
> > From: owner-openldap-bugs@OpenLDAP.org
> > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of Christoph Neerfeld
>
> > We have quite the same problem. In our setup we have only 500 entries
> > and at most 200 client machines. The database is mostly read only
> > besides the changes of user passwords.
> >
> > After the import of the data via ldif the server runs very fast and
> > after three weeks the performace degrades dramatically. slapd starts
> > eating up cpu cycles for each request. Restarting slapd does not
> > change anything.
> >
> > I read the FAQ and most parts of the bdb documentation. AFAIR most
> > tips for performance tuning are related to write access to the
> > database which is of no concern to us.
> > The only hint I found is to increase the bdb cache but
> > 'db_stat -m' already reports a cache hit rate of 98%.
> >
> > So I tried another thing. I stoped slapd, removed those __db.00? files
> > and all log.00* files which db_archive reported are not longer used
> > and started slapd again. I don't know if this can corrupt my database
> > but it fixes the problem. slapd runs again with the same speed as
> > after a fresh import of the data.
> >
> > If this is a configuration problem and no bug I would appreciate any
> > hints to what I have to change.
> >
> > Here are some details to our setup:
> >
> > - Linux SMP kernel 2.4.20 running on i386 with two processors
> > - debian woody
> > - ext2 filesystem
> > - openldap 2.1.21
> > - bdb 4.1.25 compiled with --disable-largefiles
> >
> > Regards
> >
> > Christoph Neerfeld
> >
> > > There are other sites with larger installations running under heavy
> > > load that
> > > have not experienced this problem. As such, this sounds like a cache
> > > configuration problem on your end. Have you read the FAQ?
> > > http://www.openldap.org/faq/data/cache/893.html
> >
> > >  -- Howard Chu
> > >  Chief Architect, Symas Corp.       Director, Highland Sun
> > >  http://www.symas.com               http://highlandsun.com/hyc
> > >  Symas: Premier OpenSource Development and Support
> >
> > > > -----Original Message-----
> > > > From: owner-openldap-bugs@OpenLDAP.org
> > > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of ldap@uic.edu
> >
> > > > Full_Name: Andrew J. Herbert
> > > > Version: 2.1.21
> > > > OS: Linux
> > > > URL: ftp://ftp.openldap.org/incoming/
> > > > Submission from: (NULL) (128.248.172.135)
> > > >
> > > >
> > > > System master and slave pair running openldap v2.1.21 and
> > > > Berkeley DB 4.1.25 on
> > > > Linux 2.4.18 systems (RH7.3 with updates) filesystems are ext3.
> > > >
> > > > We have an issue using the PADL software pam_ldap module on a
> > > > Solaris V880 with
> > > > approx 40,000 users against OpenLDAP. pam_ldap is not
> > > > configured with the root
> > > > DN and the ACL are setup to allow no modification by anyone
> > > > bar the root DN. As
> > > > such the LDAP database can be considered to be read-only.
> > > >
> > > > After running for a few hours, the server starts taking an
> > > > inordinately long (>1
> > > > min) to do a simple lookup. If we stop the server and compare
> > > > the database files
> > > > with a 'known good' one, we find that the files have changed.
> > > > Performing a
> > > > slapcat on the database takes in excess of 30 mins to run,
> > > > but produces a
> > > > correct LDIF which can then be reloaded (around an hour for
> > > > this) and the server
> > > > then continues to run normally for another few hours.
> > > >
> > > > We can reproduce this, we have tried the following
> > > >
> > > > Originally this system came online running 2.1.17 on a pair
> > > > of IDE based
> > > > servers. We moved it to newer faster SCSI based servers (Sun
> > > > LX50's) and still
> > > > had the same problems. We upgraded the system to 2.1.21 and
> > > > the problem was
> > > > still present. If we leave the master and slave running long
> > > > enough, eventually
> > > > they both enter this slow mode of operation.
> >
> > --
> > Christoph Neerfeld
> >
> > FH Bonn-Rhein-Sieg       | e-mail: Christoph.Neerfeld@FH-BRS.DE
> > FB Angewandte Informatik |
> > Grantham Allee 20        | phone : +49 2241/865-241
> > 53757 Sankt Augustin     |
> > Germany - Deutschland    | fax   : +49 2241/865-8241
> >
> >
> >
>