[Date Prev][Date Next]
Re: back-mdb notes
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chu <email@example.com>
Thought this was an interesting read:
Too bad he talks about his approach being "2006 era" programming. In fact
the single-level store is 1964-era, from Multics.
I guess they'll have to tweak Henry Spencer's quote ("Those who do not
understand UNIX are condemned to reinvent it, poorly.") to Multics
I've been working on a new "in-memory" B-tree library that operates on an
mmap'd file. It is a copy-on-write design; it supports MVCC and is immune
to corruption and requires no recovery procedure. It is not an
append-only design, since that requires explicit compaction, and also is
not amenable to mmap usage. Also the append-only approach requires total
serialization of write operations, which would be quite poor for
The current approach simply reserves space for two root node pointers and
flip flops between them. So, multiple writes may be outstanding at once,
but commits are of course serialized; each commit causes the currently
unused root node pointer to become the currently valid root node pointer.
Transaction aborts are pretty much free; there's nothing to rollback.
Read transactions begin by snapshotting the current root pointer and then
can run without any interference from any other operations.
Public commits have been waiting for our official transition to git, but
since that's been going nowhere I will probably start publishing on
github.com in the next couple of weeks. (With St. Patrick's Day right
around the corner it may have to wait a bit.)
Unfortunately I realized that not all application-level caching can be
eliminated - with the hierarchical DB approach, we don't store full entry
DNs in the DB so they still need to be generated in main memory, and they
probably should be cached. But that's a detail to be addressed later; it
may well be that the cost of always constructing them on the fly (no
caching) is acceptable.
This backend should perform much better in all aspects (memory, CPU, and
I/O usage) than the current BerkeleyDB code. It eliminates two levels of
caching, entries pulled from the DB require zero decoding, readers
require no locks, writes require no write-ahead-logging overhead. There
are only two configurable parameters (the pathname to the DB file, and
the size) so this will be far simpler for admins.
Let me know when you think it is in a state where I can start doing some
perf testing with it. ;)
Sr. Member of Technical Staff
A Division of VMware, Inc.
Zimbra :: the leader in open source messaging and collaboration