[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: back-mdb - futures...

To: OpenLDAP Devel <openldap-devel@openldap.org>
Subject: Re: back-mdb - futures...
From: Emmanuel Lecharny <elecharny@apache.org>
Date: Mon, 18 May 2009 09:40:00 +0200
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=O/dqUdIBjonSoXHV5vwzpi7y+x2qL0iObYlxLkOY1PM=; b=p43lcYbFeGA4ef9uFnR7nCxyU5tRGeVRTuWLwDEKP25dV3G72ck8/+K+Funu4v02Du 2Tok9600FVxPIuDt5pcyP4puPPqXeoejBrHsiv0AbPbd+a25csv7ICNuwIF7nWZ6wNBL ixL9I0MS8ESDFWIFFYSnENFdQvY1mBW0Bxty0=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; b=ULMS7ZVMROI1xzf+5IuGUdbzr2Ega/QTPZ/Uo2nYTif8sYfZEa0gbQKvSQ0VwcxoVx t38WO/wppDrUO07rm9Y5Dktu946ykhr5WY6E4xY+12/wbPropSUP8hkgQRSbsvLBLQKG cxAC64xerQvf2FO5iPiN8o/5tv757AltrNCvQ=
In-reply-to: <4A0F924B.6050405@symas.com>
References: <4A0F924B.6050405@symas.com>
User-agent: Thunderbird 2.0.0.21 (X11/20090318)

Howard Chu wrote:

Just some thoughts on what I'd like to see in a new memory-basedbackend...
One of the complaints about back-bdb/hdb is the complexity in thetuning; there are a number of different components that need to bebalanced against each other and the proper balance point variesdepending on data size and workload. One of the directions we wereinvestigating a couple years back was mechanisms for self-tuning ofthe caches. (This was essentially the thrust of Jong-Hyuk Choi's workwith zoned allocs for the back-bdb entry cache; it would allow largechunks of the entry cache to be discarded on demand when system memorypressure increased.) Unfortunately Jong hasn't been active on theproject in a while and it doesn't appear that anyone else was trackingthat work. Self-tuning is still a goal but it seems to me to beattacking the wrong problem.
One of the things that annoys me with the current BerkeleyDB baseddesign is that we have 3 levels of cache operating at all times -filesystem, BDB, and slapd. This means at least 2 memory copyoperations to get any piece of data from disk into working memory, andyou have to play games with the OS to minimize the waste in the FScache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS,which was based on the M68K platform. One of their (many) claims tofame was the notion of a single-level store: the processorarchitecture supported a full 32 bit address space but it was uncommonfor systems to have more than 24 bits worth of that populated, andnobody had anywhere near 1GB of disk space on their entire network. Assuch, every byte of available disk space could be directly mapped to avirtual memory address, and all disk I/O was done thru mmaps anddemand paging. As a result, memory management was completely unifiedand memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32bit machine a DB limit of 1-2GB may not be so useful any more.However, with the ubiquity of 64 bit machines, the idea becomes quiteattractive again.
The basic idea is to construct a database that is always mmap'd to afixed virtual address, and which returns its mmap'd data pagesdirectly to the caller (instead of copying them to a newly allocatedbuffer). Given a fixed address, it becomes feasible to make theon-disk record format identical to the in-memory format. Today we haveto convert from a BER-like encoding into our in-memory format, andwhile that conversion is fast it still takes up a measurable amount oftime. (Which is one reason our slapd entry cache is still so muchfaster than just using BDB's cache.) So instead of storing offsetsinto a flattened data record, we store actual pointers (since they allsimply reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3layers of caching and brings it down to 1. As another benefit, the DBwould require *zero* cache configuration/tuning - it would be entirelyunder the control of the OS memory manager, and its resident set sizewould grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner.It currently supports mmap access for read-only DBs, but it doesn'tmap to fixed addresses and still does alloc/copy before returning datato the caller.
Also, while BDB development continues, the new development is mainlyoccurring in areas that don't matter to us (e.g. BDB replication) andthe areas we care about (B-tree performance) haven't really changedmuch in quite a while. I've mentioned B-link trees a few times beforeon this list; they have much lower lock contention than plain B-treesand thus can support even greater concurrency. I've also mentionedthem to the BDB team a few times and as yet they have no plans toimplement them. (Here's a good reference:
http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engineto pursue this path. (Clearly such an engine must still provide fullACID transaction support, so this is a non-trivial undertaking.)Whether and when we embark on this is unclear; this is somewhat of an"ideal" design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd sourcebase, so it's probably an OpenLDAP 3.x target: In order to have acompletely canonical record on disk, we also need pointers toAttributeDescriptions to be recorded in each entry and thoseAttributeDescription pointers must also be persistent. Which meansthat our current AttributeDescription cache must be modified to alsoallocate its records from a fixed mmap'd region. (And we'll have toinclude a schema-generation stamp, so that if schema elements aredeleted we can force new AD pointers to be looked up when necessary.)(Of course, given the self-contained nature of the AD cache, we canprobably modify its behavior in this way without impacting any otherslapd code...)
There's also a potential risk to leaving all memory management up tothe OS - the native memory manager on some OS's (e.g. Windows) isabysmal, and the CLOCK-based cache replacement code we now use in theentry cache is more efficient than the LRU schemes that some older OSversions use. So we may get into this and decide we still need to playgames with mlock() etc. to control the cache management. That would bean unfortunate complication, but it would still allow us to do simplertuning than we currently need. Still, establishing a 1:1correspondence between virtual memory addresses and disk addresses isa big win for performance, scalability, and reduced complexity (==greater reliability)...

That sounds interesting. Now, you may consider another idea to betotally insane, but instead of writing your own DB engineimplementation, what about relying on the FS ? We discussed about thisidea recently in the Apache Directory community (we have pretty much thesame concern : 3 level of cache is just over killing). So if you takeWindow$ out of the picture (and even if you keep it in the fullpicture), many existing linux/unix FS are already implemented using aBTree (EXT3/4, BTRFS, even NTFS !). What about using this underlying FSto store entries directly, instead of building a special file which willbe a intermediate layer ? The main issue will be to manage indexes, butthat should not be a real problem. So every entry will be stored as asingle file (could be in LDIF format :)

So far, this is just a discussion we are having, but that might worth atry at some point...


Does it sound insane ?

--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Follow-Ups:
- Re: back-mdb - futures...
  - From: Anton Bobrov <Anton.Bobrov@Sun.COM>

References:
- back-mdb - futures...
  - From: Howard Chu <hyc@symas.com>

Prev by Date: back-mdb - futures...
Next by Date: Re: back-mdb - futures...
Index(es):
- Chronological
- Thread