[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: back-mdb - futures...

To: OpenLDAP Devel <openldap-devel@openldap.org>
Subject: Re: back-mdb - futures...
From: Francis Swasey <Frank.Swasey@uvm.edu>
Date: Mon, 18 May 2009 09:52:33 -0400
In-reply-to: <4A0F924B.6050405@symas.com>
References: <4A0F924B.6050405@symas.com>
User-agent: Thunderbird 2.0.0.21 (Macintosh/20090302)

What protections would be needed to deal with the (hopefully infrequent) system crash if we aredepending on the OS filesystem cache to get things from memory to disk? What would be therecovery mechanism in the worst-case system crash?


On 5/17/09 12:27 AM, Howard Chu wrote:

Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in thetuning; there are a number of different components that need to bebalanced against each other and the proper balance point variesdepending on data size and workload. One of the directions we wereinvestigating a couple years back was mechanisms for self-tuning of thecaches. (This was essentially the thrust of Jong-Hyuk Choi's work withzoned allocs for the back-bdb entry cache; it would allow large chunksof the entry cache to be discarded on demand when system memory pressureincreased.) Unfortunately Jong hasn't been active on the project in awhile and it doesn't appear that anyone else was tracking that work.Self-tuning is still a goal but it seems to me to be attacking the wrongproblem.
One of the things that annoys me with the current BerkeleyDB baseddesign is that we have 3 levels of cache operating at all times -filesystem, BDB, and slapd. This means at least 2 memory copy operationsto get any piece of data from disk into working memory, and you have toplay games with the OS to minimize the waste in the FS cache. (E.g. onLinux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS,which was based on the M68K platform. One of their (many) claims to famewas the notion of a single-level store: the processor architecturesupported a full 32 bit address space but it was uncommon for systems tohave more than 24 bits worth of that populated, and nobody had anywherenear 1GB of disk space on their entire network. As such, every byte ofavailable disk space could be directly mapped to a virtual memoryaddress, and all disk I/O was done thru mmaps and demand paging. As aresult, memory management was completely unified and memory usage wasextremely efficient.
These days you could still take that sort of approach, though on a 32bit machine a DB limit of 1-2GB may not be so useful any more. However,with the ubiquity of 64 bit machines, the idea becomes quite attractiveagain.
The basic idea is to construct a database that is always mmap'd to afixed virtual address, and which returns its mmap'd data pages directlyto the caller (instead of copying them to a newly allocated buffer).Given a fixed address, it becomes feasible to make the on-disk recordformat identical to the in-memory format. Today we have to convert froma BER-like encoding into our in-memory format, and while that conversionis fast it still takes up a measurable amount of time. (Which is onereason our slapd entry cache is still so much faster than just usingBDB's cache.) So instead of storing offsets into a flattened datarecord, we store actual pointers (since they all simply reside in themmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layersof caching and brings it down to 1. As another benefit, the DB wouldrequire *zero* cache configuration/tuning - it would be entirely underthe control of the OS memory manager, and its resident set size wouldgrow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner.It currently supports mmap access for read-only DBs, but it doesn't mapto fixed addresses and still does alloc/copy before returning data tothe caller.
Also, while BDB development continues, the new development is mainlyoccurring in areas that don't matter to us (e.g. BDB replication) andthe areas we care about (B-tree performance) haven't really changed muchin quite a while. I've mentioned B-link trees a few times before on thislist; they have much lower lock contention than plain B-trees and thuscan support even greater concurrency. I've also mentioned them to theBDB team a few times and as yet they have no plans to implement them.(Here's a good reference:
http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engineto pursue this path. (Clearly such an engine must still provide fullACID transaction support, so this is a non-trivial undertaking.) Whetherand when we embark on this is unclear; this is somewhat of an "ideal"design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base,so it's probably an OpenLDAP 3.x target: In order to have a completelycanonical record on disk, we also need pointers to AttributeDescriptionsto be recorded in each entry and those AttributeDescription pointersmust also be persistent. Which means that our currentAttributeDescription cache must be modified to also allocate its recordsfrom a fixed mmap'd region. (And we'll have to include aschema-generation stamp, so that if schema elements are deleted we canforce new AD pointers to be looked up when necessary.) (Of course, giventhe self-contained nature of the AD cache, we can probably modify itsbehavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to theOS - the native memory manager on some OS's (e.g. Windows) is abysmal,and the CLOCK-based cache replacement code we now use in the entry cacheis more efficient than the LRU schemes that some older OS versions use.So we may get into this and decide we still need to play games withmlock() etc. to control the cache management. That would be anunfortunate complication, but it would still allow us to do simplertuning than we currently need. Still, establishing a 1:1 correspondencebetween virtual memory addresses and disk addresses is a big win forperformance, scalability, and reduced complexity (== greaterreliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September inthe works; I imagine the Call For Papers will go out in a week or two.So now's a good time to pull up whatever other ideas you've had in theback of your mind for a while...)



--
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
  "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Follow-Ups:
- Re: back-mdb - futures...
  - From: Howard Chu <hyc@symas.com>

References:
- back-mdb - futures...
  - From: Howard Chu <hyc@symas.com>

Prev by Date: Re: back-mdb - futures...
Next by Date: Re: back-mdb - futures...
Index(es):
- Chronological
- Thread