[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB stuff

Le 5/5/14 1:46 PM, Hallvard Breien Furuseth a Ãcrit :
> On 05/02/2014 11:45 AM, Emmanuel LÃcharny wrote:
>> I don't know what's inside the Free List in LMDB, but I can only assume
>> it's closed to what we have in Mavibot. Let me explain what we do (or
>> will do !) in Mavibot, and what data structure we use for that.
> Thanks.
> > (...)
>> I don't know how it fits with LMDB implementation, I just wanted to give
>> you an insight on what we are working on. Hope it helps.
> LMDB's freeDB is simpler: A mapping
>    {ID of transaction which freed the pages => [page numbers]}.

We used a tree because we need to be able to find fast the pages copied
by a given transaction for a specific BTree. If we are to look for any
older transaction belonging to the same btree, {ID, page numbers} is not

> A write transaction may only use page numbers from freeDB records
> older than the oldest snapshot.
Yes, this is the current state. In Mavibot, we distinguish the elements
in the copiedPagesBTree with the pages in teh freePages list (which is a
simple linked list). The rational is that it's less costly to grap a
page from a simple linked list than from a b-tree, assuming that some
pages will be freed without having to be put into the copiedPagesBtree :
for instance, the root page can always be released if there is no read
transaction using the N-1 version, or the pages used by the b-tree
headers (same reason).

At this point, those 'optimisation' are quite pulled out of thin air, I
must admit.

> It scans the reader table for
> that:  Each read-only txn has a slot where it puts its snapshot's
> txnID.  
W edo have the same thing in memory.

> Also the datafile has metapages for the last 2 snapshots.

We have two pointers in the RecordManager header (RMH), one on the
current BtreeOfBtrees and CopiedPagesBtree, and one for the previous
version of those guys.

We always keep a track to the previous pointers until we have done the
cleanup, then we rewrite the RecordManager header with only the current
If we have a crash before the first RMH write, we restart with the
current version (which is the version before we updated whatever btrees).
If the crash occurs in the middle of the two RMH writes, then at
startup, we discover that we have the previous and current pointer, so
we know that we have to restart the cleanup
Otherwise, we are safe with the new version.
In any case, we are in a state we can restart with no breakage, except
that we may have lost the latest modification, and in a minimal time.

I do think that LMDB and Mavibot are quite similar in the base algorithm
(there is not much room for many smart and different approaches here ;),
but the implementation *is* different, due to the fact that we are in
Java, when LMDB is in C. Typically, serialization/deserialization is a
real burden for us, so is the cache. We do'nt use MemoryMappedFile atm,
but we could.

> There are no concurrent write txns.

Emmanuel LÃcharny