[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: Corruption of Index files running readonly slapd (ITS#2582)



Howard,

This is a very informative post.  I learned a few things from it myself!
If someone were to ever put together a technical
troubleshooting/tuning/recovery FAQ (hint) of some sort, this is the
kind of material I'd like to see in it.

Harold


>-----Original Message-----
>From: owner-openldap-bugs@OpenLDAP.org 
>[mailto:owner-openldap-bugs@OpenLDAP.org] On Behalf Of Howard Chu
>Sent: Thursday, June 26, 2003 6:00 AM
>To: 'Christoph Neerfeld'; openldap-bugs@OpenLDAP.org
>Subject: RE: Corruption of Index files running readonly slapd 
>(ITS#2582)
>
>
>Given the information you've provided, this still sounds like 
>either the BDB cache is inadequate or there are stale locks in 
>the way. Since the lock information is recorded in the 
>__db.00* environment files, deleting them all will also remove 
>the locks. However, there's not enough information here to 
>tell that for certain.
>
>The next time you see this slowdown occur, shutdown the slapd 
>and record all of the information you can get out of db_stat:
>	db_stat -c (lock info)
>	db_stat -l (logging info)
>	db_stat -m (memory usage)
>	db_stat -t (transactions)
>
>In particular, with slapd cleanly shut down, in the output of 
>"db_stat -c" you should see zero current locks, lockers, and 
>lock objects. If any of those are non-zero, we may have a 
>locking bug, or there is a locking bug in the BDB library. In 
>the output of "db_stat -m" you should look at the number of 
>clean and dirty pages forced from the cache. These numbers 
>should be small, preferably zero. If they are non-zero then 
>your cache is probably too small. In the output of "db_stat 
>-l" look at the number of region locks granted after waiting, 
>it should be zero or very small. In the output of "db_stat -t" 
>the number of active transactions should be zero. If not, 
>there is a bug somewhere. The number of aborted transactions 
>should be zero or very small, assuming that your usage 
>patterns are primarily read-oriented. The number of maximum 
>active transactions should be much smaller than the maximum 
>active transactions possible. If not, then you need to 
>reconfigure the transaction region.
>
>It's better to use the db_recover command than to manually delete the
>__db.00* files. Usually, if slapd has shut down cleanly, the 
>effect will be the same, but if slapd was shutdown uncleanly, 
>the db_recover command will flush the cache and make sure that 
>the last committed transactions actually make it into the database.
>
>Unless you see non-zero values for currently active lockers or 
>transactions, it's unlikely that this is an OpenLDAP bug. 
>Also, a lock management bug in OpenLDAP would most likely 
>cause slapd to hang and stop answering queries, not just make 
>it run slowly. If there is no indication of this type of bug, 
>then you have a badly configured database, and you need to 
>read the SleepyCat documentation to resolve the problem. 
>Finally, even if there's an errant locker hanging around out 
>there, it may just be a leftover from an unclean system 
>shutdown, and not actually a misplaced lock. We've been 
>discussing approaches to prevent this problem on the -devel 
>list; the issue was first mentioned in ITS#2502 and any action 
>taken will be reported there.
>
>  -- Howard Chu
>  Chief Architect, Symas Corp.       Director, Highland Sun
>  http://www.symas.com               http://highlandsun.com/hyc
>  Symas: Premier OpenSource Development and Support
>
>> -----Original Message-----
>> From: owner-openldap-bugs@OpenLDAP.org 
>> [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of Christoph 
>> Neerfeld
>
>> We have quite the same problem. In our setup we have only 
>500 entries 
>> and at most 200 client machines. The database is mostly read only 
>> besides the changes of user passwords.
>>
>> After the import of the data via ldif the server runs very fast and 
>> after three weeks the performace degrades dramatically. slapd starts 
>> eating up cpu cycles for each request. Restarting slapd does not 
>> change anything.
>>
>> I read the FAQ and most parts of the bdb documentation. AFAIR most 
>> tips for performance tuning are related to write access to the 
>> database which is of no concern to us. The only hint I found is to 
>> increase the bdb cache but 'db_stat -m' already reports a cache hit 
>> rate of 98%.
>>
>> So I tried another thing. I stoped slapd, removed those 
>__db.00? files 
>> and all log.00* files which db_archive reported are not longer used 
>> and started slapd again. I don't know if this can corrupt my 
>database 
>> but it fixes the problem. slapd runs again with the same speed as 
>> after a fresh import of the data.
>>
>> If this is a configuration problem and no bug I would appreciate any 
>> hints to what I have to change.
>>
>> Here are some details to our setup:
>>
>> - Linux SMP kernel 2.4.20 running on i386 with two processors
>> - debian woody
>> - ext2 filesystem
>> - openldap 2.1.21
>> - bdb 4.1.25 compiled with --disable-largefiles
>>
>> Regards
>>
>> Christoph Neerfeld
>>
>> > There are other sites with larger installations running 
>under heavy 
>> > load that have not experienced this problem. As such, this sounds 
>> > like a cache configuration problem on your end. Have you read the 
>> > FAQ? http://www.openldap.org/faq/data/cache/893.html
>>
>> >  -- Howard Chu
>> >  Chief Architect, Symas Corp.       Director, Highland Sun
>> >  http://www.symas.com               http://highlandsun.com/hyc
>> >  Symas: Premier OpenSource Development and Support
>>
>> > > -----Original Message-----
>> > > From: owner-openldap-bugs@OpenLDAP.org 
>> > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of 
>ldap@uic.edu
>>
>> > > Full_Name: Andrew J. Herbert
>> > > Version: 2.1.21
>> > > OS: Linux
>> > > URL: ftp://ftp.openldap.org/incoming/
>> > > Submission from: (NULL) (128.248.172.135)
>> > >
>> > >
>> > > System master and slave pair running openldap v2.1.21 
>and Berkeley 
>> > > DB 4.1.25 on Linux 2.4.18 systems (RH7.3 with updates) 
>filesystems 
>> > > are ext3.
>> > >
>> > > We have an issue using the PADL software pam_ldap module on a 
>> > > Solaris V880 with approx 40,000 users against OpenLDAP. pam_ldap 
>> > > is not configured with the root
>> > > DN and the ACL are setup to allow no modification by anyone
>> > > bar the root DN. As
>> > > such the LDAP database can be considered to be read-only.
>> > >
>> > > After running for a few hours, the server starts taking an 
>> > > inordinately long (>1
>> > > min) to do a simple lookup. If we stop the server and 
>compare the 
>> > > database files with a 'known good' one, we find that the files 
>> > > have changed. Performing a
>> > > slapcat on the database takes in excess of 30 mins to run,
>> > > but produces a
>> > > correct LDIF which can then be reloaded (around an hour for
>> > > this) and the server
>> > > then continues to run normally for another few hours.
>> > >
>> > > We can reproduce this, we have tried the following
>> > >
>> > > Originally this system came online running 2.1.17 on a 
>pair of IDE 
>> > > based servers. We moved it to newer faster SCSI based 
>servers (Sun
>> > > LX50's) and still
>> > > had the same problems. We upgraded the system to 2.1.21 and
>> > > the problem was
>> > > still present. If we leave the master and slave running long
>> > > enough, eventually
>> > > they both enter this slow mode of operation.
>>
>> --
>> Christoph Neerfeld
>>
>> FH Bonn-Rhein-Sieg       | e-mail: Christoph.Neerfeld@FH-BRS.DE
>> FB Angewandte Informatik |
>> Grantham Allee 20        | phone : +49 2241/865-241
>> 53757 Sankt Augustin     |
>> Germany - Deutschland    | fax   : +49 2241/865-8241
>>
>>
>>
>