[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Issues arising from creating powerdns backend based on LMDB

Mark Zealey wrote:

Hi Howard, I've now got lmdb working with powerdns in place of kyoto -
nice and easy to do thanks! Maximum DNS query load is a little better -
about 10-30% depending on use-case, but for us the main gain is that you
can have a writer going on at the same time - I was struggling a bit
with how to push updates from a different process using kyoto. There's a
few issues and things I'd like to comment on though:

1) Can you update documentation to explain what happens when I do a
mdb_cursor_del() ? I am assuming it advances the cursor to the next
record (this seems to be the behaviour). However there is some sort of
bug with this assumption. Basically I have a loop which jumps
(MDB_SET_RANGE) to a key and then wants to do a delete until key is like
something else. So I do while(..) { mdb_cursor_del(),
mdb_cursor_get(..., MDB_GET_CURRENT)}. This works fine mostly, but
roughly 1% of the time I get EINVAL returned when I try to
MDB_GET_CURRENT after a delete. This always seems to happen on the same
records - not sure about the memory structure but could it be something
to do with hitting a page boundary somehow invalidating the cursor?

That's exactly what it does, yes.

the moment I just catch that and then do an MDB_NEXT to skip over them
but this will be an issue for us on live. This is from perl so it /may/
be that, or the version of lmdb that is shipped with it however the perl
layer is a very thin wrapper and looking at the code I can only think it
comes from lmdb.

2) Currently, because kyoto cabinet didn't have support for multiple
identical keys we don't use the DUP options. This leads to quite long
keys (1200-1300 bytes in some cases). In the future, it would be nice to
have a run-time keylength specifier or something along those lines.

I don't foresee that ever happening. The max keysize will always be constrained such that two nodes fit on a page. But we've added the get_maxkeysize() function so that in the future we can increase the limit, there's really no technical reason why it needs to be stuck at 511 bytes.

3) Perhaps a mdb_cursor_get_key() function (like kyoto) which doesn't
return the data (just the key). As in (2) we store all the data in the
key - not sure how much of a performance difference this would make though

Two answers: In mdb_cursor_get, the data param can be NULL if you don't want the data. Also, since LMDB is zero-copy, all it's doing is storing a pointer value anyway, so the cost difference of returning the data is pretty much nil.

4) Creating database with non-sequential keys is very bad (on 4gb
databases, 2* slower than kyoto - about 1h30 and uses more memory). I
spent quite a bit of time looking at this in perl and then C. Basically
I create a database, open 1 txn and then insert a bunch of unordered
keys. Up to about 500mb it's fine and nice and quick - from perl about
75k inserts/sec (slow mostly because it's reading from mysql). However
after than first 500mb it starts flushing to disk. In a sequential
insert case the flush is very quick - 100-200mb/sec or so. However on
non-sequential insert I've seen it drop to like 4 or 5mb/sec as it's
writing data all over the disk rather than big sequential writes. iostat
shows the same ~200tps of write, 100% usage but only 4-10mb/sec of bytes
being written.

However, even when it's not flushing (or when storing data on SSD or
memdisk), after the first 500mb performance massively drops off to
perhaps 10-15k inserts/sec. At the same time, looking at `top`, once the
residential memory hits about 500mb, the 'shared memory' starts being
used and residential just keeps on increasing. I'm not sure if this is
some kind of kernel accounting thing to do with mmap usage but it
doesn't happen for sequential key inserts (for those, shared mem stays
around 0, residential stays 500mb). I'm using centos 6 with various
different kernels from default to 3.7.5 and the behaviour is the same. I
don't really know how to go about looking for the root cause of this but
I'm pretty sure that whilst the IO is crippling it in places there is
something else going on of which the shared memory increase is a sign.
I've tried using the WRITEMAP option too which doesn't seem to affect
anything significantly in terms of performance or memory usage.

None of the memory behavior you just described makes any sense to me. LMDB uses a shared memory map, exclusively. All of the memory growth you see in the process should be shared memory. If it's anywhere else then I'm pretty sure you have a memory leak. With all the valgrind sessions we've run I'm also pretty sure that *we* don't have a memory leak.

As for the random I/O, it also seems a bit suspect. Are you doing a commit on every key, or batching multiple keys per commit?

5) pkgconfig/rpms would be really nice to have. Or do you expect it to
just be bundled with a project as eg the perl module does?

The OpenLDAP Project releases source code, period. Distros do whatever they do. FreeBSD and Debian have LMDB packages now; if you want RPMs I suggest you ask your distro provider.

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/