[Date Prev][Date Next] [Chronological] [Thread] [Top]

back-mdb notes

To: OpenLDAP-devel@openldap.org
Subject: back-mdb notes
From: Howard Chu <hyc@symas.com>
Date: Sat, 05 Mar 2011 05:05:41 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0b11pre) Gecko/20110130 Firefox/4.0b11pre SeaMonkey/2.1b2pre

Thought this was an interesting read:

http://www.varnish-cache.org/trac/wiki/ArchitectNotes

Too bad he talks about his approach being "2006 era" programming. In fact thesingle-level store is 1964-era, from Multics.


http://en.wikipedia.org/wiki/Single_level_store

I guess they'll have to tweak Henry Spencer's quote ("Those who do notunderstand UNIX are condemned to reinvent it, poorly.") to Multics instead...

I've been working on a new "in-memory" B-tree library that operates on anmmap'd file. It is a copy-on-write design; it supports MVCC and is immune tocorruption and requires no recovery procedure. It is not an append-onlydesign, since that requires explicit compaction, and also is not amenable tommap usage. Also the append-only approach requires total serialization ofwrite operations, which would be quite poor for throughput.

The current approach simply reserves space for two root node pointers and flipflops between them. So, multiple writes may be outstanding at once, butcommits are of course serialized; each commit causes the currently unused rootnode pointer to become the currently valid root node pointer. Transactionaborts are pretty much free; there's nothing to rollback. Read transactionsbegin by snapshotting the current root pointer and then can run without anyinterference from any other operations.

Public commits have been waiting for our official transition to git, but sincethat's been going nowhere I will probably start publishing on github.com inthe next couple of weeks. (With St. Patrick's Day right around the corner itmay have to wait a bit.)

Unfortunately I realized that not all application-level caching can beeliminated - with the hierarchical DB approach, we don't store full entry DNsin the DB so they still need to be generated in main memory, and they probablyshould be cached. But that's a detail to be addressed later; it may well bethat the cost of always constructing them on the fly (no caching) is acceptable.

This backend should perform much better in all aspects (memory, CPU, and I/Ousage) than the current BerkeleyDB code. It eliminates two levels of caching,entries pulled from the DB require zero decoding, readers require no locks,writes require no write-ahead-logging overhead. There are only twoconfigurable parameters (the pathname to the DB file, and the size) so thiswill be far simpler for admins.

Potential downside - on a 32 bit machine with only 2GB of addressable memorythe maximum usable DB size is around 1.6GB. On a 64 bit machine, I doubt thelimits will pose any problem. ("64 bits should be enough for anyone...")

re: configuring the size of the DB file - this is most likely not a value thatcan be changed on an existing DB. I.e., if you configure a DB and find thatyou need to grow it later, you will probably need to slapcat/slapadd it again.At DB creation time the file is mmap'd with address NULL so that the OS picksthe address, and the address is recorded in the DB. On subsequent opens thefile is mmap'd at the recorded address. If the size is changed, and theprocess' address space is already full of other mappings, it may not bepossible to simply grow the mapping at its current address. Since the DBrecords contain actual memory pointers based on the region address, any changein the mapping address would render the DB unusable.

If this restriction turns out to be too impractical, we may have to resort tojust storing array offsets, but that will then imply a decoding phase and there-introduction of entry caching, which I really really want to avoid.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Follow-Ups:
- Re: back-mdb notes
  - From: Gavin Henry <ghenry@OpenLDAP.org>
- Re: back-mdb notes
  - From: Quanah Gibson-Mount <quanah@zimbra.com>
- Re: back-mdb notes
  - From: Quanah Gibson-Mount <quanah@zimbra.com>
- Re: back-mdb notes
  - From: Hallvard B Furuseth <h.b.furuseth@usit.uio.no>

Prev by Date: valgrind and dynamic modules
Next by Date: Re: back-mdb notes
Index(es):
- Chronological
- Thread