[Date Prev][Date Next] [Chronological] [Thread] [Top]

bug?



Hi all,

There exist some problem in openldap (up to 2.1.23) using back-bdb
(back-ldbm doesn't suffer from this). The problem occurs when you have
requested a certain number of request to slapd. The system starts to
perform dramtically. response time drops from some milliseconds to 40
seconds and more.

I've adressed this problem sometime ago, and suggested inplementing a new
cache replacement policy. The reason why i thought it was the replacement
policy is because increasing the cachesize somehow solves the problem.

I therefore implemented a replacement policy based on ARC (see my previous
posting). The results were quite disappointing. The problem still exists. 
Next thing i tried was to simulate a little delay in the LRU code in ldap.
(since ARC has a little bit more constant calculation overhead). This
somehow postphones the problem, the problem occurs after 40k request
instead of the usual 15k request.

btw: I suggested ARC for the postgreSQL db because there were problem with
vacuum and sequential scan. ARC does help a lot there. It is now in the
headbrance. 

db_stats also show a lot of DB_LOCK_NOWAIT. I've traced this in the source
and found that openldap uses a lot of retrying without waits. I therefore
used a transaction backoff patch from the head brance and the systems
performance is worse, but after the 15k reqeust you would constantly get
answer in about 20 seconds. This also doesn't solve the problem entirely.
Well it does look alot better, constant acceptable delay is better then
increasing delays.

After analysing the data i think i can assume there's some problem with
back-bdb (and  the default cache size).

after 50k request there were > 9000 Millions of lock not granted due to
db_nowait. This also means that a large portion of this will 'bash' the
avl tree inside openldap (from what i could see, at most places items are
first inserted in the avl tree and in case of an failure due to locks it
will be removed again). Is this bashing on the avl causing the dramatic
performance?

As stated before, a larger cache size solves the problem but this doesn't
mean there's nothing wrong with the system itself. I think the larger
cache size only masked the real problem. In literature LRU is known to
perform bad against large sequence patterns. But it should not 'die'. So
i think it isn't a cache problem. Large sequential scans only blows away
the cache, causing a fairly constant delay. (i'm seeing somewhat lineair
to exponential delay time for which i don't have a logical explanation) 

The default cache size is 1000 entries, it seems strange that after 14k
the system starts to fail due to too small cache. a request triggers about
3-5 cache request calls. if the problem was a small cache, it should
already display itself after about 1000/3-5 request. (i simulated a worse
case scenario: ie generated a large sequential search of different items
all the time)

I really want to understand what i'm seeing and hoping you could provide
me with more insight so i/you might actually fix this problem. 

how to reproduce this:
very simple, fill up a database with some 50k random entries. write a
little shell programm that request a bout 20k entries. after about 14k
request you should see a significant drop in the response time. 

regards
Cuong

btw: i used openldap 2.1.23 and bdb 4.1.25