[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Freeing unused pages between two revisions...

On 15/12/14 22:26, Howard Chu wrote:
Yes, finding that gap is the tricky part now. I'm guessing we'll need a
freeDB format change to have enough info to do it, though maybe not.
Currently we record that in txn X, some list of pageIDs was freed. In
LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page
was written.

Preserving the range from [txnO,txnL] pretty much means we cannot reuse
any pages up to txnL's free list. I.e., the page's stamp must be > txnL,
it must have been written at a point no txn in [txnO,txnL] could have
seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
 > txnL. Perhaps we don't need a freeDB change.

Sure we do.  The freeDB should have a more efficient format anyway,
for overflow pages.  Peeking at a lot of old pages and rejecting
them can be a cache massacre.  And a file page in the middle of
a multi-page overflow page has no header to peek at anyway.

Also: A transaction which must inspect a page to see if it can be
reused, must not overwrite that info in the page.  The txn might
abort or get rolled back by a system crash.  Instead, it can move
the page to a "reusable" list and commit.  Then we wait one or two
transactions for sync to metapages, and then we can reuse the page.

This may not be needed when the newer txnid in the page header is
sufficient info, but I suggest we don't try to keep that straight
yet: Can't overwrite if the page ends up inside an overflow page.
Also the newer txnid will not always imply it can be reused.

Anyway, consider a page written by transaction ID (revision) W and
freed (no longer seen) by txnid F: Visible to snapshots [W..F-1].
Transaction F writes freeDB record {F: list of pages freed by F}.

To F, oldest snapshot "S" which can see the page = max(W, oldest
current snapshot).  Store (F-S) along with the page number, it
will often fit in the unused bits of the page number.  If it
doesn't, we can fall back to some expanded format, or peek at the
page later, or return to the old way: Wait for all older readers.

Page numbers have 16 unused bits on 64-bit CPUs, log2(page size)
unused bits on 32-bit CPUs.  Use at least one bit to indicate a
multi-page overflow page, or that the expanded format is used and
this includes both age and number of file pages.  (Overflow pages
are used for nodes bigger than half a page.)

Once we have determined that a page can be reused, it can move to
a freeDB record format which doesn't care about old snapshot IDs.
Unless it does get reused at once, of course.

About MDB_RDONLY transactions: They just store their snapshot's
transaction ID in the reader table and clear it on abort/commit.

Write transactions scan the reader table and do all the thinking.
When they scan the table to make a list of all readers, they can
save the youngest ones in a bit-vector indexed by transaction ID
and the ones which do not fit there in a list.  If most readers
are ephemeral, that list will be short.  If we sort the pages in
a freeDB record by (F-S) above, it's easy to check the youngest
freeDB records against the bitmap to look for pages we can use.