[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB: comparison contexts, lifetimes, composite keys

Lauri Alanko wrote:

I have some non-OpenLDAP-related LMDB questions and suggestions. I hope
this is the correct list for them. If not, please point to a better

This is explicitly documented as the correct list.
It seems that MDB_cmp_func is a plain binary function without a context
parameter. This makes it impossible to tune a comparison operation at
runtime (without libffi magic, anyway). In particular, it makes it
difficult to create bindings for the set_compare operation in other

It would be foolish to implement the comparator in a non-native compiled language.

Funnily enough MDB_rel_func _does_ have a context pointer, even though
it is not used yet. So would it be feasible to add a similarly
contextful version of the comparison function? (Arguably these two could
share the context pointer, it could just be mdb_dbi_set_userctx.)

Anything is feasible but this will never be done. The comparator is called a huge number of times in any DB operation. Adding an additional parameter to the calling sequence causes a significant, measurable slowdown.

Adding an optional API that other languages can use will bloat the code and cause the C code to slow down. That would be unacceptable.

The comparator is perhaps the single most performance-sensitive code in the entire LMDB library. Anything that slows it down will be rejected outright. (As your sugggestion has been, a few times already.)

Although the documentation in lmdb.h is generally good, it's a bit
wishy-washy when it comes to lifetimes of data. The mdb_get
documentation only says: "The memory pointed to by the returned values
is owned by the database". My understanding is that the pointers to the
key/data pair retrieved from the database are valid for the duration of
the current transaction. After the transaction is over, the page in
which the key/data pair was located may be reclaimed for other uses. If
this is correct, could it be reflected in the documentation a bit more

That is basically correct for read-only transactions. It is not necessarily true for read-write transactions, since you can contrive a sequence of ops in a single transaction that causes a memory page to be allocated and subequently deleted (using a particular sequences of puts/deletes).

Areas of the documentation that are under-specified are intentionally so. API-level documentation does not and should not expose such details of the underlying implementation. API users should not rely on such details. Overall the doc tells you to avoid long-lived transactions. That also means use the data as soon as you retrieve it and then move on.

The current behavior is a bit inconsistent, though, in that depending on
the operation, mdb_cursor_get may or may not update the `key` argument
to point to the key in the database. It would be much clearer if after
the call it _always_ pointed to the key in the database, even for
operations (like MDB_SET_KEY) where the value of the key would be the
same as the input argument.

The current behavior is precisely documented. MDB_SET does not return the key, MDB_SET_KEY does. If you want the key to always be returned, use MDB_SET_KEY and forget about MDB_SET.
Finally, what is the recommended way of dealing with composite keys?
Suppose I want to look up a value associated to a pair of keys (say, a
directory id and a filename). There are several ways of doing this:

* Encode the pair (key1, key2) into a single key, use that normally.

* Use MDB_DUPSORT, encode each (key2, value) as data for a duplicate key
key1. Make sure the encoding is such that the data are sorted by key2.
Then use MDB_GET_BOTH_RANGE to look up the nearest entry for a given
(key1, key2) pair and check it actually has the correct key2.

* Create a named database for each key1, store its name as the data for
key1. Store normal (key2, value) pairs in the named database,

Each of these has its problems. The first is somewhat wasteful,
especially if, say, key1 is long and often repeated. The second is what
I use, but it feels a bit kludgy: I'm supposed to use a key-value store
yet I still have to manually decode and encode key-value pairs to and
from their constituents. The third has lots of overhead if there are
very few entries for a given key1, and it requires us to manually manage
the lifetimes of the named databases.

My understanding is that MDB_DUPSORT internally does something close to
the third option (but with optimizations to avoid wasting pages for just
a few entries), but just using empty data items in the key-specific
subdatabase. Would it then be feasible to add direct support for
composite keys by allowing non-empty data there?

That was planned in the very beginning, but later dropped because we didn't have a use case for it.

You should just use MDB_DUPSORT and a custom dup comparator that ignores the value part that you appended to the key.

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/