[Date Prev][Date Next]
RE: Corruption of Index files running readonly slapd (ITS#2582)
This is a very informative post. I learned a few things from it myself!
If someone were to ever put together a technical
troubleshooting/tuning/recovery FAQ (hint) of some sort, this is the
kind of material I'd like to see in it.
>[mailto:owner-openldap-bugs@OpenLDAP.org] On Behalf Of Howard Chu
>Sent: Thursday, June 26, 2003 6:00 AM
>To: 'Christoph Neerfeld'; openldap-bugs@OpenLDAP.org
>Subject: RE: Corruption of Index files running readonly slapd
>Given the information you've provided, this still sounds like
>either the BDB cache is inadequate or there are stale locks in
>the way. Since the lock information is recorded in the
>__db.00* environment files, deleting them all will also remove
>the locks. However, there's not enough information here to
>tell that for certain.
>The next time you see this slowdown occur, shutdown the slapd
>and record all of the information you can get out of db_stat:
> db_stat -c (lock info)
> db_stat -l (logging info)
> db_stat -m (memory usage)
> db_stat -t (transactions)
>In particular, with slapd cleanly shut down, in the output of
>"db_stat -c" you should see zero current locks, lockers, and
>lock objects. If any of those are non-zero, we may have a
>locking bug, or there is a locking bug in the BDB library. In
>the output of "db_stat -m" you should look at the number of
>clean and dirty pages forced from the cache. These numbers
>should be small, preferably zero. If they are non-zero then
>your cache is probably too small. In the output of "db_stat
>-l" look at the number of region locks granted after waiting,
>it should be zero or very small. In the output of "db_stat -t"
>the number of active transactions should be zero. If not,
>there is a bug somewhere. The number of aborted transactions
>should be zero or very small, assuming that your usage
>patterns are primarily read-oriented. The number of maximum
>active transactions should be much smaller than the maximum
>active transactions possible. If not, then you need to
>reconfigure the transaction region.
>It's better to use the db_recover command than to manually delete the
>__db.00* files. Usually, if slapd has shut down cleanly, the
>effect will be the same, but if slapd was shutdown uncleanly,
>the db_recover command will flush the cache and make sure that
>the last committed transactions actually make it into the database.
>Unless you see non-zero values for currently active lockers or
>transactions, it's unlikely that this is an OpenLDAP bug.
>Also, a lock management bug in OpenLDAP would most likely
>cause slapd to hang and stop answering queries, not just make
>it run slowly. If there is no indication of this type of bug,
>then you have a badly configured database, and you need to
>read the SleepyCat documentation to resolve the problem.
>Finally, even if there's an errant locker hanging around out
>there, it may just be a leftover from an unclean system
>shutdown, and not actually a misplaced lock. We've been
>discussing approaches to prevent this problem on the -devel
>list; the issue was first mentioned in ITS#2502 and any action
>taken will be reported there.
> -- Howard Chu
> Chief Architect, Symas Corp. Director, Highland Sun
> http://www.symas.com http://highlandsun.com/hyc
> Symas: Premier OpenSource Development and Support
>> -----Original Message-----
>> From: owner-openldap-bugs@OpenLDAP.org
>> [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of Christoph
>> We have quite the same problem. In our setup we have only
>> and at most 200 client machines. The database is mostly read only
>> besides the changes of user passwords.
>> After the import of the data via ldif the server runs very fast and
>> after three weeks the performace degrades dramatically. slapd starts
>> eating up cpu cycles for each request. Restarting slapd does not
>> change anything.
>> I read the FAQ and most parts of the bdb documentation. AFAIR most
>> tips for performance tuning are related to write access to the
>> database which is of no concern to us. The only hint I found is to
>> increase the bdb cache but 'db_stat -m' already reports a cache hit
>> rate of 98%.
>> So I tried another thing. I stoped slapd, removed those
>> and all log.00* files which db_archive reported are not longer used
>> and started slapd again. I don't know if this can corrupt my
>> but it fixes the problem. slapd runs again with the same speed as
>> after a fresh import of the data.
>> If this is a configuration problem and no bug I would appreciate any
>> hints to what I have to change.
>> Here are some details to our setup:
>> - Linux SMP kernel 2.4.20 running on i386 with two processors
>> - debian woody
>> - ext2 filesystem
>> - openldap 2.1.21
>> - bdb 4.1.25 compiled with --disable-largefiles
>> Christoph Neerfeld
>> > There are other sites with larger installations running
>> > load that have not experienced this problem. As such, this sounds
>> > like a cache configuration problem on your end. Have you read the
>> > FAQ? http://www.openldap.org/faq/data/cache/893.html
>> > -- Howard Chu
>> > Chief Architect, Symas Corp. Director, Highland Sun
>> > http://www.symas.com http://highlandsun.com/hyc
>> > Symas: Premier OpenSource Development and Support
>> > > -----Original Message-----
>> > > From: owner-openldap-bugs@OpenLDAP.org
>> > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of
>> > > Full_Name: Andrew J. Herbert
>> > > Version: 2.1.21
>> > > OS: Linux
>> > > URL: ftp://ftp.openldap.org/incoming/
>> > > Submission from: (NULL) (126.96.36.199)
>> > >
>> > >
>> > > System master and slave pair running openldap v2.1.21
>> > > DB 4.1.25 on Linux 2.4.18 systems (RH7.3 with updates)
>> > > are ext3.
>> > >
>> > > We have an issue using the PADL software pam_ldap module on a
>> > > Solaris V880 with approx 40,000 users against OpenLDAP. pam_ldap
>> > > is not configured with the root
>> > > DN and the ACL are setup to allow no modification by anyone
>> > > bar the root DN. As
>> > > such the LDAP database can be considered to be read-only.
>> > >
>> > > After running for a few hours, the server starts taking an
>> > > inordinately long (>1
>> > > min) to do a simple lookup. If we stop the server and
>> > > database files with a 'known good' one, we find that the files
>> > > have changed. Performing a
>> > > slapcat on the database takes in excess of 30 mins to run,
>> > > but produces a
>> > > correct LDIF which can then be reloaded (around an hour for
>> > > this) and the server
>> > > then continues to run normally for another few hours.
>> > >
>> > > We can reproduce this, we have tried the following
>> > >
>> > > Originally this system came online running 2.1.17 on a
>pair of IDE
>> > > based servers. We moved it to newer faster SCSI based
>> > > LX50's) and still
>> > > had the same problems. We upgraded the system to 2.1.21 and
>> > > the problem was
>> > > still present. If we leave the master and slave running long
>> > > enough, eventually
>> > > they both enter this slow mode of operation.
>> Christoph Neerfeld
>> FH Bonn-Rhein-Sieg | e-mail: Christoph.Neerfeld@FH-BRS.DE
>> FB Angewandte Informatik |
>> Grantham Allee 20 | phone : +49 2241/865-241
>> 53757 Sankt Augustin |
>> Germany - Deutschland | fax : +49 2241/865-8241