[Date Prev][Date Next] [Chronological] [Thread] [Top]

B-tree code

Hi Martin,
Just thought you'd like to know about a project I've been working on for a couple months. My current code started with your append-only B-tree source. It's just about in usable shape now

   https://gitorious.org/mdb .

Also I'll be presenting details at the LDAPCon in Hedelberg this October.


I started with your code, and removed the page cache. Instead the entire DB is accessed thru a read-only mmap region. As such, there is no longer any cache management at the DB level (it's all done by the OS/VM). I also removed the prefix-compression logic, because it made rebalancing/merging unreliable. The mmap approach avoids a ton of malloc/memcpy overhead. It also makes overflow pages quite cheap to manage.

Instead of writing a new meta-page at the tail of the file, I ping-pong between two meta pages at the head of the file. (Double-buffering.) This provides most of the MVCC benefits of the append-only approach, but without the wasted space or the need to search for the most recent meta page.

I also added tracking of outstanding read transactions, and tracking of free pages. Reader tracking is done without locks; readers are never blocked when accessing the DB (unless the OS itself is busy servicing page fauits). This way it can quickly check when a copied page is no longer referenced, and re-use the pages, so the DB no longer grows without bounds. This completely removes the need for the compaction logic. Since active data is never overwritten, the DB can never be corrupted, so no write-ahead logging is needed, nor any recovery procedures.

I also added several ideas from BerkeleyDB, so that I can drop it into OpenLDAP more easily. The DB is now a "DB environment" with support for multiple databases within an environment. This was necessary because I didn't want to have to manage multiple separate mmap's for multiple little index databases and other misc. usages. Also the free list is itself a sub-DB in the environment. I also added support for sorted-duplicate data items for a given key, which OpenLDAP's back-hdb relies on.

I'm just now getting started adapting our back-hdb code to this mdb library. It looks like the new backend will be vastly simplified, both in real code and in configuration, so it will be much friendlier to sysadmins, while at the same time giving superior performance to BerkeleyDB and excellent reliability. Of course the code is still pretty raw, and I haven't done any heavy load testing on it yet, so it remains to be seen how much of the promise is realized.

I was originally targeting a design where the mmap resides at a fixed memory address. That way slapd can store its entries as-is, instead of flattening them into a storable structure. There's a hook for a relocation function, which would be used to relocate an entry if it gets shifted around during adds/deletes/rebalances. I haven't implemented this yet because I'm not sure it will actually work well in real use. For slapd it might be OK if all entries wind up in overflow pages, since those pages aren't touched by tree balancing activity. But if average entry sizes are small, it would become a serious hassle.

I'd be interested to hear your comments on this.
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/