[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Lock is no longer valid / deferring operation

To: "Toby Blake" <toby@inf.ed.ac.uk>
Subject: Re: Lock is no longer valid / deferring operation
From: "Gavin Henry" <ghenry@suretecsystems.com>
Date: Thu, 5 Jul 2007 11:04:26 +0100 (BST)
Cc: openldap-software@openldap.org
Importance: Normal
In-reply-to: <Pine.LNX.4.64.0707051029560.1446@syd.inf.ed.ac.uk>
References: <Pine.LNX.4.64.0707041320590.23577@syd.inf.ed.ac.uk> <59275.212.159.59.85.1183578039.squirrel@webmail.suretecsystems.com> <Pine.LNX.4.64.0707051029560.1446@syd.inf.ed.ac.uk>
User-agent: SquirrelMail/1.4.10a-1.fc6

<quote who="Toby Blake">
> Hi there,
>
> Firstly, many thanks for the replies...

np.

>
>> Hi Toby.
>>
>>> For largely historical reasons we run slapd servers on most clients
>>> (this will probably change in the future - I'm just giving this
>>> information as background).
>>
>> Why?
>
> Why will this change or why did we do it in the first place?  I wasn't
> party to these decisions at the time, so I can't really comment on the
> reasons for them.  I could speculate wildly, but I'd prefer not to.

Understood.

>
>>>  We're seeing problems when some of these
>>> machines are busy, particularly, it seems, with memory intensive
>>> activity, although it's hard to substantiate as I generally only see
>>> the machines after they've broken.  It's annoying as I can't reproduce
>>> these problems.
>>
>> It's going to be hard to pin point then ;-) How much memory/CPU
>> etc. do these clients have and what other services do they provide?
>
> They're typically desktop or lab machines for academics, students,
> etc.  Hardware-wise they're Dell desktop boxes of a few years old - a
> 2.4GHz processor with 512MB of memory is typical.  Something I should
> have mentioned is that they're running Fedora Core 5, with a few
> running FC6.

OK.

>
> As for what services they provide, general desktop services, but also
> could be running long-running or intensive jobs.  Many of the machines
> are also in a condor pool and this does seem to cause more problems.
>
> Do you know if slapd gets unhappy if other processes use up lots of
> memory?  This is my current line of investigation - I'll try to make
> it unhappy by using increasing amounts of memory.

Yes.

>
> I suppose what I'm trying to determine is - is it the client activity
> that's causing problems (i.e. a misbehaving client or similar) or is
> it slapd itself getting unhappy for other reasons (possibly due to
> resources being used by other programs)?  Or a combination of both?

Probably both. If a client keeps sending lots of bind/search requests at
once, slapd will queue/defer them.

>
>>> We see quite a few problems with slapd getting into a state where it's
>>> deferring operations, for whatever reason - I think I understand these
>>> - these are when slapd basically says sorry, I'm too busy doing X, so
>>> I'll defer Y until I have time.  Is this accurate?
>>
>> Yes. What kind of clients are searching/binding to them? Local?
>
> All local.  As for what kind of clients - typical linux desktop
> activity I suppose.  Hard to be specific about this really, as it will
> change from host to host.

OK.

Is this happening on all desktops then?

>
>>> The second case I'm also seeing is bdb complaining about locks being
>>> no longer valid, e.g.
>>>
>>> slapd[3780]: bdb(dc=inf,dc=ed,dc=ac,dc=uk): DB_LOCK->lock_put: Lock is
>>> no
>>> longer valid
>>>
>>> slapd seems to keep going for the time being until getting into a
>>> state where it defers all binding operations and goes into some kind
>>> of spin where it sits at 99% cpu and has to be killed with a -9.
>>
>> Is everything local? Nothing mounted locally, like NFS for the directory
>> data.
>
> Machines will have both NFS and AFS for home directory data.

Not the data directory then, ok.

>
>>> I suppose I have a couple of questions about the "Lock is no longer
>>> valid" error....
>>>
>>> - What causes it?
>>> - Is it something I can prevent by configuration changes (for
>>>    instance, would increasing the numbers of locks, lockers and objects
>>>    help?)
>>
>> One for the dev team. I do know this is an error message from
>> Berkeley DB by grepping the source.
>
> Yes, I saw it in the source, but don't know it well enough to be sure
> of what's causing it.

Likewise.

>
>>> We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with
>>> a bdb backend running 4.2.52 with all 6 recommended patches.
>>
>> I hope you mean 5, as there are only 5 listed on the Oracle site.
>
> As Quanah said, there are 6.
>
>>> The only DBCONFIG settings we currently have are:
>>>
>>> dbconfig      set_cachesize 0 67108864 1
>>> dbconfig      set_lg_regionmax 262144
>>> dbconfig      set_lg_bsize 2097152
>>
>> I take it dbconfig is a keyword you've added for this example, as
>> it's not valid.
>
> Sorry, I should have been more specific - this is in slapd.conf - look
> in the man page for slapd-bdb - this is just a way of getting
> directives into DB_CONFIG.

Yeah, my mistake. I forgot about that way.

>
> Cheers
> Toby
>

Follow-Ups:
- Re: Lock is no longer valid / deferring operation
  - From: Toby Blake <toby@inf.ed.ac.uk>

References:
- Lock is no longer valid / deferring operation
  - From: Toby Blake <toby@inf.ed.ac.uk>
- Re: Lock is no longer valid / deferring operation
  - From: "Gavin Henry" <ghenry@suretecsystems.com>
- Re: Lock is no longer valid / deferring operation
  - From: Toby Blake <toby@inf.ed.ac.uk>

Prev by Date: Re: Lock is no longer valid / deferring operation
Next by Date: Re: force use of start_tls: how?
Index(es):
- Chronological
- Thread