[Date Prev][Date Next] [Chronological] [Thread] [Top]

MDB questions



Hi there,

I have a few questions about MDB, and I have some things I'd like to
work on.

In the docs there are a few references that reference binary searching.
It's not 100% clear but I assume this is a binary search of the keys in
a BTree node, not that MDB is a bst. 

How does MDB provide crash resilience on the free pages?

According to man, free() should only be called on memory from malloc
but I see that you use free on mmaped pages in mdb_dpage_free. There
must be something I'm missing here about this.

Anyway, I have two things I want to work on.

The simple one is when pages are moved from the txn free list to the
env free list (I hope that's correct), it would be good to call
madvise(MADV_REMOVE) on the data section. 

The reason for this is that the madvise call will allow supported
filesystems to hole punch the sparse file, allowing space reclamation -
without MDB needing to worry about it!

The much more invasive change I want to work on is page checksumming.
Basically there are 4 cases I have in mind

* No checksumming (today)
* Metadata checksumming only
* Metadata and data checksumming

These could be used in these scenarios:

* write checksums but don't verify them at run time
* write checksums, and only verify metadata on read (possibly a good
default option)
* write checksums, and verify metadata and data on read (slowest, but
has strong integrity properties for some applications)

And in all cases I want to add an "mdb_verify" command that would
assert all of these are also correct offline.

There are a few reasons for this

* Hardware is unreliable. Ram, disk, cables, even cpu cache memory can
all exhibt bit flips and other data loss. Changing a bit in a pointer
can cause damage to any datastructure, and flows on to crashes or
silent corruption
* Software is never perfect - checksumming allows detection of over-
writes of data from overflow or other mistakes that we as humans all
make.

I'd opt to use something fast like crc32c (intel provides hardware to
accelerate this with -march=native). The only issue I see is that this
would require an ondisk structure change because the current structs
don't have space for this  -and the csums have to be *first*.

http://www.lmdb.tech/doc/group__internal.html#structMDB__page

The checksum would have to be the first value *or* the last value of
the page header, (so that it can be updated without affecting the
result of the checksum). The checksum for the data would have to be
within the header so that this is asserted as correct.

Is this something I should pursue? Would this require a ondisk format
change? Is there something that could be done to avoid this?


Thanks,

William