[Date Prev][Date Next] [Chronological] [Thread] [Top]

finegrain locking in LDBM (relates to dissection of search latency)



I've been looking for places where threads in the slapd server spend a
lot of time waiting for a lock. I added a debugging routine in
ldap_pvt_mutex_lock() to tell me if a thread had to wait for another
thread to release the lock, and if that wait time was longer than, say,
20msec.

I found that threads wait most often for the LDBM cache lock
cache->c_mutex. This protects the linked list and AVL trees of Entry's
that are in memory instead of reading LDIF off disk and converting them to
an Entry struct. (The BDB backend does remarkable improvement of this step
by storing entries on disk as a more evolved species than LDIF).

So what I tried doing was increasing the granularity of this locking:

1) Separate the c_mutex lock into 3 locks:
     c_dntree to protect the  dn to Entry AVL
     c_idtree to protect the  id to Entry AVL
     c_lrulock to protect the LRU linked lists

 This means that when one thread has to write lock the DN AVL to insert an
 Entry, the ID AVL tree can still be used concurrently, and vice versa. It
 also means the LRU can be updated while other threads search on the AVL
 btrees.


2) Make the DN and ID tree locks reader/writer locks just like the Entry
 locks are (ldbm_entry_info->lei_rdwr).  This means that if you have a
 cache that gets a lot of hits, there can be a lot of concurrency of
 threads in search operations reading the AVL btrees to find Entry's.


3) Leave the LRU lock a mutex since there are few readonly operations on
 the LRU. Also, the LRU code for a cache hit is only a dozen or so machine
 instructions, so it should be fast.

4) Some slight reordering of operations to work with smaller locks


I think this code shows a few percentage points speed increase overall for
the testing that I've done.  My testing is a 16 threaded client running for
about a half hour doing a list of searches. The LDBM entry cache can hold
all of the working data set of about 40,000 or so entries, so I get a high
cache hit rate.  Sites with different cache size vs data set  ratios will see
different results.

Some numbers on performance.

This mainly helps in cache_find_entry_id(), which is where Entry's are
fetched from the cache.  It is called by id2entry_rw().  A single search
operation spent about 20-35% of its time in id2entry_rw()   (depending on
how you count the time a thread spends in the yield() when it's done with
an operation)

My measurement was how much time does id2entry_rw spend calling
cache_find_entry_id() with the old code versus with the new fine grain
locking:

   46% of its time with single mutex locking
   37% of its time with fine grain locking



I'm now asking if other people have benchmarks that they could try on this
fine grain locking (Jonghyuk Choi?). This kind of testing is very
susceptible to the data set used in the test and the type of client used.

A new back-ldbm/cache.c is available at

	http://nil.andrew.cmu.edu/cache.c

At the top of it is a few gdiff-u lines to tell what additional changes
lines have to be added to init.c and back-ldbm.h to create and initialize
the new mutexes.

If it works out well for other people, I can look at getting it into the
source repository.


-Mark Adamson
 Carnegie Mellon