[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: back-mdb - futures...

To: Anton Bobrov <Anton.Bobrov@sun.com>
Subject: Re: back-mdb - futures...
From: Howard Chu <hyc@symas.com>
Date: Mon, 18 May 2009 19:25:28 -0700
Cc: Emmanuel Lecharny <elecharny@apache.org>, OpenLDAP Devel <openldap-devel@openldap.org>
In-reply-to: <4A113709.30302@sun.com>
References: <4A0F924B.6050405@symas.com> <4A1110D0.7080609@nextury.com> <4A113709.30302@sun.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; rv:1.9.1b5pre) Gecko/20090517 SeaMonkey/2.0a1pre Firefox/3.0.3

Emmanuel Lecharny wrote:

That sounds interesting. Now, you may consider another idea to be
totally insane, but instead of writing your own DB engine
implementation, what about relying on the FS ? We discussed about this
idea recently in the Apache Directory community (we have pretty much the
same concern : 3 level of cache is just over killing). So if you take
Window$ out of the picture (and even if you keep it in the full
picture), many existing linux/unix FS are already implemented using a
BTree (EXT3/4, BTRFS, even NTFS !). What about using this underlying FS
to store entries directly, instead of building a special file which will
be a intermediate layer ? The main issue will be to manage indexes, but
that should not be a real problem. So every entry will be stored as a
single file (could be in LDIF format :)

So far, this is just a discussion we are having, but that might worth a
try at some point...

Does it sound insane ?

In fact we already have a back-ldif which does exactly this, but it's notintended for real use. It was only written to serve as a vehicle forback-config. (I.e., we wanted a simple, zero-config persistent store thatcould still behave like an LDAP database in very specifically defined usecases.) I'm pretty sure we've documented that it's not recommended for generalpurpose use, although some folks seem to want to misuse it that way regardless.

One of the main downsides - any such backend requires a couple system calls toaccess any given entry, and that generally means at least a few contextswitches. No matter how wonderfully efficient the FS itself is, anything thatforces you to switch context between user mode and kernel mode for every entryis always a loss.

And no matter how wonderful these FSs are, to my knowledge none of them areusing B-link trees, which means they all still have higher lock contentionthan necessary for reads, inserts, and deletes. In fact the only open B-linkimplementation I'm aware of is written in Java (bonus for you guys!), and someof the thornier issues of B-link management have only been solved in the pastcouple years. When I first started looking into them a few years ago the issueof Delete rebalancing hadn't actually been solved yet. This is all pretty newstuff. (In the original paper, the authors described how to do searches andinserts without any lock-coupling, which is a huge concurrency win. They hadno solution for deletes though, and just allowed deleted nodes to accumulatein the tree.)

For a C implementation I'd try to re-use as much as possible of the existingBerkeleyDB code since it's quite mature and provides a lot of features wealready like/want/need...


Anton Bobrov wrote:

i did try a dummy prototype awhile back and it doesnt perform very well.
you end up incurring too much overhead and it doesnt pay off even when
underlaying FS data is 100% cached. plus you can never truly control
what happens with FS cache, you can size and influence it in some ways
but you cannot guarantee your operation will hit cached data which does
make it difficult to deliver predictable response times, in other words
you gonna have to accept I/O hits and widen your response window to the
worst case scenario for at least some %tage of operations. this can be
optimized and made more predictable on a black box where you control
the entire machine but moot otherwise. the FS was ZFS and just for the
record the perf didnt suck per se but didnt quite match traditional db
backends perf [especially with entry caches] either. i dont have slamd
comparison data anymore to show you unfortunately.

Also true, which is one of the reasons I wasn't too thrilled with Jong'soriginal line of research here; it would degrade slapd's performance for thebenefit of anything else on the box when other processes' resource demandsincreased. But in the face of a heavily overcommitted machine, all bets areoff and you might as well go down gracefully instead of getting killed by OOMor somesuch.


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

References:
- back-mdb - futures...
  - From: Howard Chu <hyc@symas.com>
- Re: back-mdb - futures...
  - From: Emmanuel Lecharny <elecharny@apache.org>
- Re: back-mdb - futures...
  - From: Anton Bobrov <Anton.Bobrov@Sun.COM>

Prev by Date: Re: back-mdb - futures...
Next by Date: Re: back-mdb - futures...
Index(es):
- Chronological
- Thread