[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: slapd hangs at 100% cpu in sched_yield (ITS#2030)

sorry to be a pain, but I really don't like the idea of returning an error to the client under this condition.  I have been running with the first patch I sent to you since just before I submitted the patch, and it has not caused any problems.  When a locker is unavailable, the program loops ~20-30 times (which takes practically no time), and then continues when a locker is freed.

The reson I am trying to figure another way around this is because sending an ldap error to the client because the server is too busy (which is not really true, as a locker does become available almost immediately) will cause us errors that are hard to debug , as the ldap server starts randomly rejecting requests as lockers become scarce.  If we increase the number of lockers, we will just delay the problem, as the ldap server becomes busier, and starts using even more lockers, we will hit the limit again.  I would prefer to see worse performance if it takes a while waiting for a locker to become available rather than having a ldap lookup fail, which will cause problems for us. I have had a closer look at the db4 source, and it looks like you are right, where ENOMEM is returned from the lock_id() routine.
Looking at the possible return codes from the __lock_id function in lock/lock.c, I see:
ret=0 at the top (as the default)
ret = __lock_getlocker(lt, *idp, locker_ndx, 1, &lk);
return (ret);

inside the __lock_getlocker() function we have:
return (ENOMEM); (which is the part of the code I was getting an error from)
return (0); (the default)

So... as far as I can see, lock_id() will return EINVAL, ENOMEM or 0.

ENOMEM is returned when "Lock table is out of available locker entries".  
As far as I can tell (and please correct me if I am wrong), the reason that we run out of locks is because other threads are holding onto them.  
Increasing the number of locks will possibly improve performance (as we don't need to wait for another thread to finish with it's lock), but as long as we are getting an ENOMEM error, the database is out of locks (because another thread is holding the lock), and we should loop until the other thread frees the lock.  This certainly fixes the problem on our system, as the first patch I submitted has been running for the past day or two without any problems.

What I am not sure about is how many locker entries may be being held by each thread, and how many are currently enabled in the slapd code.  The defaults should be 1000 (according to the db4 docs), which is a lot more that I thought slapd should use.



On Wed, 21 Aug 2002 00:30:55 Kurt D. Zeilenga wrote:
>At 01:33 AM 2002-08-20, steven.wilton@team.eftel.com wrote:
>>How about adding the following lines to the patch you have applied to
>Because, as far as I can tell from looking at DB4 sources,
>LOCK_ID() does not return DB_LOCK_NOTGRANTED.
>They kinds of errors LOCK_ID() does return, such as ENOMEM,
>are generally mapped to LDAP_OTHER slapd(8).  LDAP_BUSY
>is a possibility here.
>I note that looping waiting for resources to free generally
>causes makes resource starvation problems worse not better.
>Resource starvation is best resolved by making more resources
>available to the process (or by coding changes to reduce the
>demand for resources).
>> If the lock is rejected for the given reason, there is nothing major
>wrong with the database, but we should retry.  The client program does
>not know that the ldap server is only having a temporary error getting
>the data (as opposed to if the lock is rejected due to something like a
>corrupt database, where we should send an error back to the client).
>>                rc = LOCK_ID ( bdb->bi_dbenv, &locker );
>>                switch(rc) {
>>                case 0:
>>                        break;
>>+               case DB_LOCK_NOTGRANTED:
>>+                       ldap_pvt_thread_yield();
>>+                       goto retry;
>>                default:
>>                        return LDAP_OTHER;
>>                }
>>We use ldap to authenticate users, and if one of the ldap client
>programs detects an error, unusual things will happen on the system (some
>requests will work, while a random number of connections will fail for no
>good reason).
>>On Tue, 20 Aug 2002 09:48:11 Kurt D. Zeilenga wrote:
>>>I agree that the return result of LOCK_ID() should be checked.
>>>I've added code which causes an LDAP_OTHER error if LOCK_ID()
>>>fails, which in a quick check of DB4 code, is consistent with
>>>possible error conditions.