[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB questions



William Brown wrote:
Hi there,

I have a few questions about MDB, and I have some things I'd like to
work on.

The current name is LMDB.

In the docs there are a few references that reference binary searching.
It's not 100% clear but I assume this is a binary search of the keys in
a BTree node, not that MDB is a bst.

There's no need to assume. https://symas.com/lmdb/technical/#pubs

How does MDB provide crash resilience on the free pages?

According to man, free() should only be called on memory from malloc
but I see that you use free on mmaped pages in mdb_dpage_free. There
must be something I'm missing here about this.

We obviously do not call free() on mmap'd pages. Mmap'd pages just sit there.

Anyway, I have two things I want to work on.

The simple one is when pages are moved from the txn free list to the
env free list (I hope that's correct), it would be good to call
madvise(MADV_REMOVE) on the data section.

No, it wouldn't.

The reason for this is that the madvise call will allow supported
filesystems to hole punch the sparse file, allowing space reclamation -
without MDB needing to worry about it!

Freespace reclamation is just added overhead. The pages will be reused in a future transaction anyway, hole punching would just make the filesystem do more work reassigning them back to the DB later.

The much more invasive change I want to work on is page checksumming.

This already exists in LMDB 1.0, along with page-level encryption.

Basically there are 4 cases I have in mind

* No checksumming (today)
* Metadata checksumming only
* Metadata and data checksumming

These could be used in these scenarios:

* write checksums but don't verify them at run time
* write checksums, and only verify metadata on read (possibly a good
default option)
* write checksums, and verify metadata and data on read (slowest, but
has strong integrity properties for some applications)

And in all cases I want to add an "mdb_verify" command that would
assert all of these are also correct offline.

There are a few reasons for this

* Hardware is unreliable. Ram, disk, cables, even cpu cache memory can
all exhibt bit flips and other data loss. Changing a bit in a pointer
can cause damage to any datastructure, and flows on to crashes or
silent corruption

IMO none of this is relevant. Data centers that require reliability will use ECC and redundant hardware. If you're not using these things, then clearly reliability isn't a high priority for you.

CPU caches are all ECC protected already, as are storage drives. The correct place to check for corruption above the drive is at the filesystem layer.

The main reason we added checksum support is as a side-effect of providing a space for the signature in authenticated encryption.

* Software is never perfect - checksumming allows detection of over-
writes of data from overflow or other mistakes that we as humans all
make.

By default, with a read-only memory map, unintended overwrites of data is not possible.

I'd opt to use something fast like crc32c (intel provides hardware to
accelerate this with -march=native). The only issue I see is that this
would require an ondisk structure change because the current structs
don't have space for this  -and the csums have to be *first*.

The checksums can live wherever you want to put them. You just skip over a gap where you insert the result later. My inclination is to put them near the page header, but it depends on a few other decisions as well.

http://www.lmdb.tech/doc/group__internal.html#structMDB__page

The checksum would have to be the first value *or* the last value of
the page header, (so that it can be updated without affecting the
result of the checksum). The checksum for the data would have to be
within the header so that this is asserted as correct.

Is this something I should pursue? Would this require a ondisk format
change? Is there something that could be done to avoid this?

I see no way to do this without a format change. This is why the feature has waited till LMDB 1.0 for rollout. There are multiple format-changing features in LMDB 1.0 and the on-disk format is still in flux there.

If you want to discuss this further, we should use the openldap-devel mailing list instead.


Thanks,

William





--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/