[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: back-mdb notes



Quanah Gibson-Mount wrote:
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chu<hyc@symas.com>
wrote:

I've been working on a new "in-memory" B-tree library that operates on an
mmap'd file. It is a copy-on-write design; it supports MVCC and is immune
to corruption and requires no recovery procedure. It is not an
append-only design, since that requires explicit compaction, and also is
not amenable to mmap usage. Also the append-only approach requires total
serialization of write operations, which would be quite poor for
throughput.

My experience with back-(bdb/hdb) and syncrepl was the only reliable way to
ensure consistent replication was to use delta-syncrepl which... serializes
write operations.  In fact, not forcing serialized writes for
back-(bdb/hdb) was slower than serializing things, because of all the
contention in the database.  I understand this may not hold true for
back-mdb, but thought I would note that currently our best performance is
already achieved by serialization, write-wise.

I'm well aware of all of this, no need to remind me. Non-serialized writes in bdb/hdb tended to run into deadlocks all the time, and the retries are slow. (In fact, we intentionally slow them down with an exponential backoff. This feature is probably detrimental on a heavily loaded machine since the thread can't do any useful work during the backoff.)

I expect the occurrence of deadlocks using MVCC to be drastically reduced. Readers will never be the cause of deadlocks in mdb, so that's half the problem gone already. Writers will hold locks and be able to block each other, so that possibility remains.

re: configuring the size of the DB file - this is most likely not a value
that can be changed on an existing DB. I.e., if you configure a DB and
find that you need to grow it later, you will probably need to
slapcat/slapadd it again. At DB creation time the file is mmap'd with
address NULL so that the OS picks the address, and the address is
recorded in the DB. On subsequent opens the file is mmap'd at the
recorded address. If the size is changed, and the process' address space
is already full of other mappings, it may not be possible to simply grow
the mapping at its current address. Since the DB records contain actual
memory pointers based on the region address, any change in the mapping
address would render the DB unusable.

How exactly does the DB file size for back-mdb relate to the existing size
of the database?  Do they have to match?

Not at all. This configures a maximum size that the DB will consume on disk. The DB can be whatever size, and grow to that limit.

 I.e., is this more like the
DB_CONFIG cachesize, which can be more or less than the database size, or
are they supposed to be an exact match? We have plenty of customers who
have databases that are certainly not static in size.  Particularly if you
are using an accesslog databases for delta-syncrepl or other operations.

Obviously it would be stupid to require them to match.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/