[Date Prev][Date Next] [Chronological] [Thread] [Top]

back-mdb - futures...

To: OpenLDAP Devel <openldap-devel@openldap.org>
Subject: back-mdb - futures...
From: Howard Chu <hyc@symas.com>
Date: Sat, 16 May 2009 21:27:55 -0700
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; rv:1.9.1b5pre) Gecko/20090514 SeaMonkey/2.0a1pre Firefox/3.0.3

Just some thoughts on what I'd like to see in a new memory-based backend...

One of the complaints about back-bdb/hdb is the complexity in the tuning;there are a number of different components that need to be balanced againsteach other and the proper balance point varies depending on data size andworkload. One of the directions we were investigating a couple years back wasmechanisms for self-tuning of the caches. (This was essentially the thrust ofJong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it wouldallow large chunks of the entry cache to be discarded on demand when systemmemory pressure increased.) Unfortunately Jong hasn't been active on theproject in a while and it doesn't appear that anyone else was tracking thatwork. Self-tuning is still a goal but it seems to me to be attacking the wrongproblem.

One of the things that annoys me with the current BerkeleyDB based design isthat we have 3 levels of cache operating at all times - filesystem, BDB, andslapd. This means at least 2 memory copy operations to get any piece of datafrom disk into working memory, and you have to play games with the OS tominimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)

Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, whichwas based on the M68K platform. One of their (many) claims to fame was thenotion of a single-level store: the processor architecture supported a full 32bit address space but it was uncommon for systems to have more than 24 bitsworth of that populated, and nobody had anywhere near 1GB of disk space ontheir entire network. As such, every byte of available disk space could bedirectly mapped to a virtual memory address, and all disk I/O was done thrummaps and demand paging. As a result, memory management was completely unifiedand memory usage was extremely efficient.

These days you could still take that sort of approach, though on a 32 bitmachine a DB limit of 1-2GB may not be so useful any more. However, with theubiquity of 64 bit machines, the idea becomes quite attractive again.

The basic idea is to construct a database that is always mmap'd to a fixedvirtual address, and which returns its mmap'd data pages directly to thecaller (instead of copying them to a newly allocated buffer). Given a fixedaddress, it becomes feasible to make the on-disk record format identical tothe in-memory format. Today we have to convert from a BER-like encoding intoour in-memory format, and while that conversion is fast it still takes up ameasurable amount of time. (Which is one reason our slapd entry cache is stillso much faster than just using BDB's cache.) So instead of storing offsetsinto a flattened data record, we store actual pointers (since they all simplyreside in the mmap'd space).

Using this directly mmap'd approach immediately eliminates the 3 layers ofcaching and brings it down to 1. As another benefit, the DB would require*zero* cache configuration/tuning - it would be entirely under the control ofthe OS memory manager, and its resident set size would grow or shrinkdynamically without any outside intervention.

It's not clear to me that we can modify BDB to operate in this manner. Itcurrently supports mmap access for read-only DBs, but it doesn't map to fixedaddresses and still does alloc/copy before returning data to the caller.

Also, while BDB development continues, the new development is mainly occurringin areas that don't matter to us (e.g. BDB replication) and the areas we careabout (B-tree performance) haven't really changed much in quite a while. I'vementioned B-link trees a few times before on this list; they have much lowerlock contention than plain B-trees and thus can support even greaterconcurrency. I've also mentioned them to the BDB team a few times and as yetthey have no plans to implement them. (Here's a good reference:

http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )

As such, it seems likely that we would have to write our own DB engine topursue this path. (Clearly such an engine must still provide full ACIDtransaction support, so this is a non-trivial undertaking.) Whether and whenwe embark on this is unclear; this is somewhat of an "ideal" design and asalways, "good enough" is the enemy of "perfect" ...

This isn't a backend we can simply add to the current slapd source base, soit's probably an OpenLDAP 3.x target: In order to have a completely canonicalrecord on disk, we also need pointers to AttributeDescriptions to be recordedin each entry and those AttributeDescription pointers must also be persistent.Which means that our current AttributeDescription cache must be modified toalso allocate its records from a fixed mmap'd region. (And we'll have toinclude a schema-generation stamp, so that if schema elements are deleted wecan force new AD pointers to be looked up when necessary.) (Of course, giventhe self-contained nature of the AD cache, we can probably modify its behaviorin this way without impacting any other slapd code...)

There's also a potential risk to leaving all memory management up to the OS -the native memory manager on some OS's (e.g. Windows) is abysmal, and theCLOCK-based cache replacement code we now use in the entry cache is moreefficient than the LRU schemes that some older OS versions use. So we may getinto this and decide we still need to play games with mlock() etc. to controlthe cache management. That would be an unfortunate complication, but it wouldstill allow us to do simpler tuning than we currently need. Still,establishing a 1:1 correspondence between virtual memory addresses and diskaddresses is a big win for performance, scalability, and reduced complexity(== greater reliability)...

(And yes, by the way, we have planning for LDAPCon2009 this September in theworks; I imagine the Call For Papers will go out in a week or two. So now's agood time to pull up whatever other ideas you've had in the back of your mindfor a while...)

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Follow-Ups:
- Re: back-mdb - futures...
  - From: Emmanuel Lecharny <elecharny@apache.org>
- Re: back-mdb - futures...
  - From: Francis Swasey <Frank.Swasey@uvm.edu>
- Re: back-mdb - futures...
  - From: Hallvard B Furuseth <h.b.furuseth@usit.uio.no>

Prev by Date: Re: GSSAPI signing/encryption for unsuspectingly applications (its not a bug)
Next by Date: Re: back-mdb - futures...
Index(es):
- Chronological
- Thread