[Date Prev][Date Next] [Chronological] [Thread] [Top]

B-tree code

To: martin@bzero.se
Subject: B-tree code
From: Howard Chu <hyc@symas.com>
Date: Thu, 11 Aug 2011 16:23:35 -0700
Cc: OpenLDAP-devel@openldap.org
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0a1) Gecko/20110612 Firefox/7.0a1 SeaMonkey/2.4a1

Hi Martin,

Just thought you'd like to know about a project I've been working on for acouple months. My current code started with your append-only B-tree source.It's just about in usable shape now


   https://gitorious.org/mdb .

Also I'll be presenting details at the LDAPCon in Hedelberg this October.

   http://www.daasi.de/ldapcon2011/index.php?site=program

I started with your code, and removed the page cache. Instead the entire DB isaccessed thru a read-only mmap region. As such, there is no longer any cachemanagement at the DB level (it's all done by the OS/VM). I also removed theprefix-compression logic, because it made rebalancing/merging unreliable. Themmap approach avoids a ton of malloc/memcpy overhead. It also makes overflowpages quite cheap to manage.

Instead of writing a new meta-page at the tail of the file, I ping-pongbetween two meta pages at the head of the file. (Double-buffering.) Thisprovides most of the MVCC benefits of the append-only approach, but withoutthe wasted space or the need to search for the most recent meta page.

I also added tracking of outstanding read transactions, and tracking of freepages. Reader tracking is done without locks; readers are never blocked whenaccessing the DB (unless the OS itself is busy servicing page fauits).This way it can quickly check when a copied page is no longer referenced, andre-use the pages, so the DB no longer grows without bounds. This completelyremoves the need for the compaction logic. Since active data is neveroverwritten, the DB can never be corrupted, so no write-ahead logging isneeded, nor any recovery procedures.

I also added several ideas from BerkeleyDB, so that I can drop it intoOpenLDAP more easily. The DB is now a "DB environment" with support formultiple databases within an environment. This was necessary because I didn'twant to have to manage multiple separate mmap's for multiple little indexdatabases and other misc. usages. Also the free list is itself a sub-DB in theenvironment. I also added support for sorted-duplicate data items for a givenkey, which OpenLDAP's back-hdb relies on.

I'm just now getting started adapting our back-hdb code to this mdb library.It looks like the new backend will be vastly simplified, both in real code andin configuration, so it will be much friendlier to sysadmins, while at thesame time giving superior performance to BerkeleyDB and excellent reliability.Of course the code is still pretty raw, and I haven't done any heavy loadtesting on it yet, so it remains to be seen how much of the promise is realized.

I was originally targeting a design where the mmap resides at a fixed memoryaddress. That way slapd can store its entries as-is, instead of flatteningthem into a storable structure. There's a hook for a relocation function,which would be used to relocate an entry if it gets shifted around duringadds/deletes/rebalances. I haven't implemented this yet because I'm not sureit will actually work well in real use. For slapd it might be OK if allentries wind up in overflow pages, since those pages aren't touched by treebalancing activity. But if average entry sizes are small, it would become aserious hassle.


I'd be interested to hear your comments on this.
--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Prev by Date: Attribute 'auditContext' replicated
Next by Date: No symlinks in Git please (was: openldap.git branch mdb created.) 227e6976db20f424d4f6abda2b73bfa53034a714
Index(es):
- Chronological
- Thread