[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Looking for proven version

On Tue, Oct 21, 2003 at 07:21:56PM +0700, Beast wrote:
> These combination are running fine on 3 sites (with average 200 users)
> but not in one site which has aroound 600+ users.  After few days, slapd
> become unresponsive (or totally hang, but not died, ie. ldapserach takes
> forever to completes).  restarting slapd sometimes helps but sometimes
> not.  I have made few tuning on db as suggested in faq and this list, but
> did not help. I understands that many of you are having more users and
> higher load than mine so i guess my setup is not correct (but why it
> works on other site? ;)

I have been having the same issue on my RedHat 9 boxes.  I have found out
what is going on, but have yet to figure out why.  What is happening for me
is that my indexes are getting corrupted.  About every 4-5 days or so, the
whole system becomes very unresponsive and slapd is pegging the CPU at
99.9% usage (2.8 GHz. P4 w/ 1 Gig of RAM).  If I rebuild stop slapd and
rebuild the indexes, it takes care of it for another 4-5 days or so.  But
it is really getting tiring to have to do this.

I've done a lot of googling over the last 2 months trying to figure out
why this might be happening, and I've had a couple different leads.  Let me
just list them here, but with the disclaimer that I don't really understand
what I'm talking too much.  :)

Lead #1:
The kernel's handling of the O_DIRECT flag.  Supposedly the NPTL stuff
in the RedHat kernel does some kind of wierdness to apps that use db4.
This has especially come up with people having problems with rpm (it uses
db4) hanging, corrupting the database, etc. on RedHat 9.

Lead #2:
Something in the db4 threading having issues with the kernel's NPTL stuff
(does this make sense?).  It makes sense in my head that this could
possibly cause corruption, but whether that is realistic or not I don't
know.  I've seen several things that say to build db4 without threading
support, but I haven't found that in the configure script as an option so
I've been hesitant to believe this one.

Where am I right now?  I've rebuilt openldap from the source rpm probably
a dozen times over the last few months toggling different options each time
to see it that will take care of the corruption problems.  The source rpm
contains its own copy of db4 (4.1.25) that it statically linked against, so
I've also rebuilt that library with different options too.  Each time same
old, same old.  My current approach is to play with the kernel to see if
that's what could be causing it.  Last Friday (the 17th) I rebooted into a
2.4.22 stock kernel.org kernel, and have been keeping a close eye on it.
So far so good -- in 2 hours it will be 4 days with the new slapd and no
index corruption yet.  The CPU usage of slapd has seemed to go down as well
with the new kernel for some reason.  Over the past few months, after 4
days of running slapd would have eaten up around 500 minutes of CPU time
(we have a modest network here too with about 600 users and 100 systems).
I'm currently looking at 87 minutes of CPU time for slapd.

Hopefully this is useful information.  Our problems sounded similar and so
I thought I'd let you know what I was looking at.  If you figure more out,
please let me know, and I'll do the same.