8475 – Feature request: MDB low durability transactions

Issue 8475 - Feature request: MDB low durability transactions

Summary: Feature request: MDB low durability transactions

Status:	UNCONFIRMED

Alias:	None

Product:	LMDB
Classification:	Unclassified
Component:	liblmdb (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-08-06 15:38 UTC by bentrask@comcast.net
Modified:	2020-03-13 19:16 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description bentrask@comcast.net 2016-08-06 15:38:02 UTC

Full_Name: Ben Trask
Version: 
OS: 
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (192.241.251.11)


Hi,

Transaction commits are one of the few bottlenecks in MDB, because it has to
fsync twice, sequentially.

I think MDB could support mixed low and high durability transactions in the same
database by adding per-page checksums and a third root page. The idea is that
when committing a low-durability transaction, no fsyncs are performed. One of
the two other roots is always consistent on-disk to serve as a "backstop" in the
event of power loss/kernel panic. The two other pages are allowed to alternate,
with consistency only ensured by checksums.

During a restore after a crash, the non-durable roots (if any) get compared
against the third durable root. Any new pages get checksum'd to ensure
integrity. In the worst case, the durable root is used.

Comment 1 Hallvard Furuseth 2016-08-06 16:42:14 UTC

On 06. aug. 2016 17:38, bentrask@comcast.net wrote:
> Transaction commits are one of the few bottlenecks in MDB, because it has to
> fsync twice, sequentially.
>
> I think MDB could support mixed low and high durability transactions in the same
> database by adding per-page checksums and a third root page. The idea is that
> when committing a low-durability transaction, no fsyncs are performed. (...)

Yesno.  We can get rid of fsyncs, but not that way.  Checksumming each
page isn't enough.  We must know it's the right version of the page and
not e.g. a similar page from a previous aborted transaction.  To commit
a branch or meta page, we'd need to scan its children and checksum the
page headers (thus including their checksum) of each.  Expensive.

IIRC there are three things we can do:

- Use and fsync a WAL (write-ahead log) instead of the database pages.
   That can be cheaper because it writes one contiguous region instead
   of a lot of random-access pages.  Requires recovery after a crash.

- Volatile metapages which mdb_env_open() _always_ throws away if no
   other environment is already open.  They are lost of the application
   crashes/exits without doing a final checkpoint.

- Improve that a bit: Put them in a shared memory region, since that
   won't survive a system crash (unlike if we put them in the lockfile).
   That way they'll survive application crash provided something does
   a checkpoint before next system crash.

We've discussed these sometimes and there are caveats for some of them,
I don't quite remember.  One issue is that a "system crash" isn't the
only thing which can lose unsynced pages.  Another is unmounting and
re-mounting the disk (i.e. an USB disk).

-- 
Hallvard

Comment 2 Hallvard Furuseth 2016-08-06 16:47:42 UTC

moved from Incoming to Software Enhancements

Comment 3 Howard Chu 2016-08-06 16:56:09 UTC

bentrask@comcast.net wrote:
> Full_Name: Ben Trask
> Version:
> OS:
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (192.241.251.11)
>
>
> Hi,
>
> Transaction commits are one of the few bottlenecks in MDB, because it has to
> fsync twice, sequentially.
>
> I think MDB could support mixed low and high durability transactions in the same

We already have low durability transactions, that's what MDB_NOMETASYNC is 
for. It only does one fsync per commit instead of two.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 Hallvard Furuseth 2016-08-06 19:24:38 UTC

On 06/08/16 18:56, hyc@symas.com wrote:
> We already have low durability transactions, that's what MDB_NOMETASYNC is
> for. It only does one fsync per commit instead of two.

Haha, I didn't notice that since this reminded me of other discussions.
The stuff I was talking about was for fewer fsyncs than one per commit.

Comment 5 bentrask@comcast.net 2016-08-07 20:57:37 UTC

Thanks for the replies, Hallvard and Howard!

I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. 
However, my proposal would allow fsync to be omitted entirely.

I think my approach with three roots is better than a WAL because it 
keeps the read and write paths simpler and more uniform. It also doesn't 
force periodic fsyncs when the log wraps, or consume unbounded space. In 
fact it's very similar to the basic design of MDB.

You're right that you'd actually need to record the page's checksum in 
the parent, rather than in the page itself. I guess this would hurt the 
branching factor.

Thanks again,
Ben

On 08/06/2016 12:42 PM, Hallvard Breien Furuseth wrote:
> On 06. aug. 2016 17:38, bentrask@comcast.net wrote:
>> Transaction commits are one of the few bottlenecks in MDB, because it
>> has to
>> fsync twice, sequentially.
>>
>> I think MDB could support mixed low and high durability transactions
>> in the same
>> database by adding per-page checksums and a third root page. The idea
>> is that
>> when committing a low-durability transaction, no fsyncs are performed.
>> (...)
>
> Yesno.  We can get rid of fsyncs, but not that way.  Checksumming each
> page isn't enough.  We must know it's the right version of the page and
> not e.g. a similar page from a previous aborted transaction.  To commit
> a branch or meta page, we'd need to scan its children and checksum the
> page headers (thus including their checksum) of each.  Expensive.
>
> IIRC there are three things we can do:
>
> - Use and fsync a WAL (write-ahead log) instead of the database pages.
>   That can be cheaper because it writes one contiguous region instead
>   of a lot of random-access pages.  Requires recovery after a crash.
>
> - Volatile metapages which mdb_env_open() _always_ throws away if no
>   other environment is already open.  They are lost of the application
>   crashes/exits without doing a final checkpoint.
>
> - Improve that a bit: Put them in a shared memory region, since that
>   won't survive a system crash (unlike if we put them in the lockfile).
>   That way they'll survive application crash provided something does
>   a checkpoint before next system crash.
>
> We've discussed these sometimes and there are caveats for some of them,
> I don't quite remember.  One issue is that a "system crash" isn't the
> only thing which can lose unsynced pages.  Another is unmounting and
> re-mounting the disk (i.e. an USB disk).
>

Comment 6 Howard Chu 2016-08-07 21:44:33 UTC

bentrask@comcast.net wrote:
> Thanks for the replies, Hallvard and Howard!
>
> I was mistaken in thinking that NOMETASYNC didn't guarantee integrity.
> However, my proposal would allow fsync to be omitted entirely.
>
> I think my approach with three roots is better than a WAL because it
> keeps the read and write paths simpler and more uniform. It also doesn't
> force periodic fsyncs when the log wraps, or consume unbounded space. In
> fact it's very similar to the basic design of MDB.
>
> You're right that you'd actually need to record the page's checksum in
> the parent, rather than in the page itself. I guess this would hurt the
> branching factor.

And then it's turtles all the way down.

What you're suggesting won't work. Trust me when I say we have spent far more 
time thinking about this question than you have.

The only way to guarantee integrity is with ordered writes. All SCSI devices 
support this feature, but e.g. the Linux kernel does not (and neither does 
SATA, and no idea about PCIe SSDs...).

Lacking a portable mechanism for ordered writes, you have two choices for 
preserving integrity - append-only operation (which forces ordered writes 
anyway) or at least one synchronous write somewhere.

Whenever you decide to reuse existing pages rather than operating as 
append-only, you create the possibility of overwriting some required data 
before it was safe to do so. Your 3-root checksum scheme *might* let you 
detect that the DB is corrupted, but it *won't* let you recover to a clean 
state. Given that writes occur in unpredictable order, without fsyncs there is 
no way you can guarantee that anything sane is on the disk.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 bentrask@comcast.net 2016-08-08 01:13:33 UTC

On 08/07/2016 05:44 PM, Howard Chu wrote:
> The only way to guarantee integrity is with ordered writes. All SCSI
> devices support this feature, but e.g. the Linux kernel does not (and
> neither does SATA, and no idea about PCIe SSDs...).
>
> Lacking a portable mechanism for ordered writes, you have two choices
> for preserving integrity - append-only operation (which forces ordered
> writes anyway) or at least one synchronous write somewhere.
>
> Whenever you decide to reuse existing pages rather than operating as
> append-only, you create the possibility of overwriting some required
> data before it was safe to do so. Your 3-root checksum scheme *might*
> let you detect that the DB is corrupted, but it *won't* let you recover
> to a clean state. Given that writes occur in unpredictable order,
> without fsyncs there is no way you can guarantee that anything sane is
> on the disk.

Consider three roots without any checksums. Each root has a simple flag 
indicating whether it was written durably (fsync write barrier). During 
recovery, non-durable roots are simply ignored/discarded. This is 
equivalent to Hallvard's suggestion for volatile meta-pages. I think 
it's pretty clear this is workable.

 From there, checksums just give you slightly stronger guarantees, 
although they might not be worth the overhead (CPU/storage) and recovery 
complexity.

Comment 8 Howard Chu 2016-08-08 01:29:43 UTC

Ben Trask wrote:
> On 08/07/2016 05:44 PM, Howard Chu wrote:
>> The only way to guarantee integrity is with ordered writes. All SCSI
>> devices support this feature, but e.g. the Linux kernel does not (and
>> neither does SATA, and no idea about PCIe SSDs...).
>>
>> Lacking a portable mechanism for ordered writes, you have two choices
>> for preserving integrity - append-only operation (which forces ordered
>> writes anyway) or at least one synchronous write somewhere.
>>
>> Whenever you decide to reuse existing pages rather than operating as
>> append-only, you create the possibility of overwriting some required
>> data before it was safe to do so. Your 3-root checksum scheme *might*
>> let you detect that the DB is corrupted, but it *won't* let you recover
>> to a clean state. Given that writes occur in unpredictable order,
>> without fsyncs there is no way you can guarantee that anything sane is
>> on the disk.
>
> Consider three roots without any checksums. Each root has a simple flag
> indicating whether it was written durably (fsync write barrier). During
> recovery, non-durable roots are simply ignored/discarded. This is equivalent
> to Hallvard's suggestion for volatile meta-pages. I think it's pretty clear
> this is workable.
>
>  From there, checksums just give you slightly stronger guarantees, although
> they might not be worth the overhead (CPU/storage) and recovery complexity.

Knowing whether or not the root pages are pristine still doesn't tell you 
anything about whether the data pages are intact. The only way to make any of 
these schemes work is to avoid overwriting/reusing any data pages for the last 
N transactions. I.e., reverting to append-only behavior. So the underlying 
question (which we have wrestled with internally for quite some time) which 
you haven't asked or answered - how many of these non-durable transactions 
will you support at any given time?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 9 bentrask@comcast.net 2016-08-08 01:41:21 UTC

On 08/07/2016 09:29 PM, Howard Chu wrote:
> Knowing whether or not the root pages are pristine still doesn't tell
> you anything about whether the data pages are intact. The only way to
> make any of these schemes work is to avoid overwriting/reusing any data
> pages for the last N transactions. I.e., reverting to append-only
> behavior. So the underlying question (which we have wrestled with
> internally for quite some time) which you haven't asked or answered -
> how many of these non-durable transactions will you support at any given
> time?

The idea was that the two "floating" roots would reuse pages the way MDB 
does now. The 3rd durable root would have its pages preserved 
separately. I can see why this would cause up to a ~2X storage increase 
as the two sets diverged, but I don't see why it would need to grow 
unbounded. Apologies for this stupid question.

Comment 10 Hallvard Furuseth 2016-08-08 09:41:28 UTC

On 08/08/16 03:51, bentrask@comcast.net wrote:
> The idea was that the two "floating" roots would reuse pages the way MDB
> does now. The 3rd durable root would have its pages preserved
> separately. I can see why this would cause up to a ~2X storage increase
> as the two sets diverged, but I don't see why it would need to grow
> unbounded. Apologies for this stupid question.

A transaction must not reuse data pages visible in the last snapshot
known to be durable, since that's how far back LDMB may need to revert
after abnormal termination.  Like a crash after MDB_NOMETASYNC may do.

Sync the data pages from a txn, write the metapage, eventually sync
that metapage, wait out any older read-only transactions, and *then*
you can reuse the pages the txn freed.  Not before.  So when you don't
sync, or a read-only txn won't die, LMDB degenerates to append-only.

...except if you sync the metapage and exit, next LMDB run may not
know you synced it and must assume the metapage isn't yet durable.
So it might not reuse pages visible to the _previous_ durable
metapage, until it syncs.  I'm rather losing track at this point,
but I think it may mean twice as may not-yet-usable pages as one
might expect.

Comment 11 bentrask@comcast.net 2016-08-08 16:04:14 UTC

On 08/08/2016 05:41 AM, Hallvard Breien Furuseth wrote:
> A transaction must not reuse data pages visible in the last snapshot
> known to be durable, since that's how far back LDMB may need to revert
> after abnormal termination.  Like a crash after MDB_NOMETASYNC may do.
>
> Sync the data pages from a txn, write the metapage, eventually sync
> that metapage, wait out any older read-only transactions, and *then*
> you can reuse the pages the txn freed.  Not before.  So when you don't
> sync, or a read-only txn won't die, LMDB degenerates to append-only.
>
> ...except if you sync the metapage and exit, next LMDB run may not
> know you synced it and must assume the metapage isn't yet durable.
> So it might not reuse pages visible to the _previous_ durable
> metapage, until it syncs.  I'm rather losing track at this point,
> but I think it may mean twice as may not-yet-usable pages as one
> might expect.

Concretely: say the current write transaction is number 10, and a 
long-lived reader is on number 7. Currently, MDB will be unable to reuse 
any pages used in transactions 7+ until the reader ends.

Now say a 3rd, durable root is added. For the sake of argument, no 
checksums are used and in the event of a crash, only the last durable 
state is recovered. Say the durable transaction is number 2. Pages used 
in transaction 2 need to be preserved, obviously. 7+ still need to be 
preserved for the slow reader. But pages from transactions 3-6 can be 
reused.

Note that the last durable transaction is controlled purely by the 
single writer, so tracking it is actually easier than tracking which 
readers are where.

If a crash happens before a durable root is fully synced, then there 
should be a second, older durable root that hasn't been reused yet. In 
that case MDB recovers the way it does currently.

Does this make sense? Thanks for bearing with me.

Comment 12 Hallvard Furuseth 2016-08-10 12:50:27 UTC

Nope, you're as confused as I was originally:-) LDMB doesn't know or
care when a page was written.  A page can be reused when the snapshot
which _freed_ it is known to be durable and there are no older readers.
(We could improve that by tracking page history better.  Maybe later.)

"Known to be durable" = sync datapages, write metapage, sync metapage,
note that the metapage was synced.  (We implicitly note that when
writing next txn's metapage, since we must have synced first.)  From
a data safety point of view, txns which do all that are the real txns.
Anything else is fluff, like no-sync txns if we implement them.  Their
metapages must go somewhere they *won't* be confused with durable ones.

Think of such a fluffy commit as saving an intermediate stage of a
real txn.  That's irrelevant to a later write-txn wanting to not touch
the last two durable snapshots.  It's only relevant vs. oldest reader.

So.  3rd metapage and variants - I've tried and Howard pointed out
the flaws, Howard tried and I said here we go again.  We do not need
another round, but it's just as well to have it summarized here.

(This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the
user enables either, it's his responsibility to compensate.)

-- 
Hallvard

Comment 13 bentrask@comcast.net 2016-08-10 14:36:22 UTC

Okay, thanks for taking the time to discuss, and of course for all your 
work on MDB!

Ben

On 08/10/2016 08:50 AM, Hallvard Breien Furuseth wrote:
> Nope, you're as confused as I was originally:-) LDMB doesn't know or
> care when a page was written.  A page can be reused when the snapshot
> which _freed_ it is known to be durable and there are no older readers.
> (We could improve that by tracking page history better.  Maybe later.)
>
> "Known to be durable" = sync datapages, write metapage, sync metapage,
> note that the metapage was synced.  (We implicitly note that when
> writing next txn's metapage, since we must have synced first.)  From
> a data safety point of view, txns which do all that are the real txns.
> Anything else is fluff, like no-sync txns if we implement them.  Their
> metapages must go somewhere they *won't* be confused with durable ones.
>
> Think of such a fluffy commit as saving an intermediate stage of a
> real txn.  That's irrelevant to a later write-txn wanting to not touch
> the last two durable snapshots.  It's only relevant vs. oldest reader.
>
> So.  3rd metapage and variants - I've tried and Howard pointed out
> the flaws, Howard tried and I said here we go again.  We do not need
> another round, but it's just as well to have it summarized here.
>
> (This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the
> user enables either, it's his responsibility to compensate.)
>

Comment 14 Hallvard Furuseth 2016-09-17 05:46:25 UTC

I'll use this ITS to summarize details for "volatile commits".
Hopefully I've managed to keep it all straight.

"Volatile" vs. "durable" are the most accurate names I can think of.
Not sure if that's more instructive than simply "soft" / "hard".

Description:

* Volatile commits omit fdatasync() without losing consistency.  To
  survive, they must be checkpointed *before* all processes close the
  env.  Un-checkpointed volatiles are lost when the env closes.

  Thus a separate checkpointing daemon can keep the env open to
  protect volatiles from application crash, at least if Robust locks
  are supported.  (The lmdb.h doc seems a bit unclear about Robust.)

* Checkpointing == committing a durable (non-volatile) write-txn.
  (If there is nothing to do, Commit writes nothing.)

  mdb_env_sync() will not checkpoint volatiles, since existing
  programs do not expect it to wait for the write mutex.  It
  "checkpoints" MDB_NOMETASYNC/MDB_NOSYNC.  Maybe mdb_checkpoint()
  will have a special case which obsoletes mdb_env_sync().

* Volatiles are unsupported with MDB_NOLOCK and pointless with
  MDB_NOSYNC.  OTOH it makes sense to enable MDB_NOMETASYNC.

* Volatiles need a bigger datafile, because it takes two durable
  commits to make a freed page reusable. (Plus awaiting old readers).

Configuration.  Too many options, ouch:

  LMDB can be configured to auto-checkpoint after X volatile commits
  and/or Y written kbytes(pages?).  Programs can also checkpoint every
  Z minutes(seconds?) - configured in LMDB to mimic Berkeley DB's
  "checkpoint <kbytes> <minutes>", but regular LMDB ops ignore that.

  The lockfile gives the current config.  An MDB_env could override
  for its particular process, e.g. with an MDB_NO_VOLATILE flag.
  Maybe resetting the lockfile should keep the previous config?
  OTOH I suppose MDB_meta can have default params the way it has
  mm_mapsize.  That survives a backup/restore.

Implementation - plain version first:

* Bump MDB_LOCK_FORMAT, MDB_DATA_VERSION (or make MDB_DATA_FORMAT).

* Keep 2 'MDB_meta's in the lockfile, for volatile commits.
  MDB_env.me_metas[] gets 4 elements: durable + volatile.
  
* mdb_env_open() throws away volatiles if it re-inits the lockfile.

* Add field MDB_meta.mm_oldest: 1 + (previous durable meta).mm_txnid
  in durable metas, and (previous meta).mm_oldest in volatile metas.

  Init 'oldest' in mdb_find_oldest() to new field MDB_env.me_oldest,
  which mdb_txn_renew0(write txn) sets to MDB_meta.mm_oldest.

  When are no volatiles, this ends up initing 'oldest' = same value
  as today.  Usually we could just have used 1 + (oldest durable
  meta).mm_txnid, but a failed write_meta() may have clobbered that.

* Replace MDB_txninfo.mti_txnid with mti_metaref = txnid*16 + P*4 + M:
    M = index to MDB_env.me_metas[],
    P = previous M during this session, initialized to the same as M,
  so we can get this info atomically.

  P may prove unnecessary, but it's simplest to just include it for
  now.  For when meta M fails a checksum so we want an older meta, for
  mdb_mutex_failed(), maybe so we can see if there are volatiles yet.

* Never use mdb_env_pick_meta() when the current metapage is known:
  Use it in mdb_env_open(), in mdb_mutex_failed(), and if MDB_NOLOCK.
  Or rather, I guess it gets a "force" param for those cases.

* Add config in the lockfile. Maybe per-env config overriding it and
  defaults in the datafile. Txn flags "prefer volatile", "checkpoint".
  
* Track the number of pages and volatiles since last durable commit.
  write_meta() compares with the config limits and makes the final
  decision of whether the new meta will be volatile.

  Add MDB_pgstate.mf_pgcount with #pages used so far.  The rest goes
  in a lockfile array[4] indexed by mti_metaref % 4, or in MDB_meta.
  That way, switching to next snapshot stays atomic - just update
  mti_metaref.

* txn_begin must verify metas, since we have no fdatasync barriers.
  Re-read and compare, or checksum.

  write_meta() and mutex_failed(): memory barrier between making
  a volatile meta and updating mti_metaref.  Most modern compilers
  have that.  Maybe a fallback implementation is lock;unlock an
  otherwise unused mutex.  Should also include CACHEFLUSH().
  
  It may make sense to have more than 2 volatile metas, so read-only
  txns will have more time to read a meta before it gets overwritten.

  MDB_WRITEMAP (and MDB_VL32?) has non-atomic issues we should deal
  with anyway.

* We can have (durable metapage).mp_pgno == (txnid & 1) as before:
  mdb_txn_renew0() steps txnid by 2 instead of 1 if 'meta' is volatile.
  But note that the txnid doesn't say if the snapshot is durable.
  
* "Try to checkpoint" feature, which does not await the write mutex:

  Trylock the write mutex in mdb_txn_begin().  If it fails, set a
  lockfile flag "Please checkpoint" and return.  Hopefully someone
  will obey and clear the flag.  mdb_env_commit(writer) does.

Variants:

* Put volatile MDB_metas in the datafile, behind the usual MDB_metas.

  That protects the metas from malicious/broken processes with
  read-only envs.  Otherwise, using the lockfile (or non-file shared
  memory) extended read-only envs' ability to hack/break the DB.
  
  Be careful to not read volatile MDB_metas that are older than last
  lockfile-reset, since the reset did not clear them.  That also means
  this variant does not enable volatiles with MDB_NOLOCK.

* Stay with MDB_DATA_VERSION = 1, no change in datafile format:

  - Volatiles share next durable txn's txnid (for freeDB keys), but
    put last durable txn's txnid in mti_metaref (for mdb_find_oldest).

  - mdb_txn_id() / MDB_envinfo.me_last_txnid can no longer be used to
    distinguish txns.

    Apps doing that could set an env flag "ignore volatiles", or txn
    flag "fail if current snapshot is volatile".

  - Define V = 2-bit sequence number incremented by commit(writer).
    Include V in MDB_txninfo.mti_metaref.  In volatile metas, include
    mm_metaref = copy of mti_metaref.

    This lets mutex_failed() figure out which MDB_meta is most recent:
    abs(V in mm_metaref - V in mti_metaref) is <= 1 whether mm_metaref
    or mti_metaref changes first in the thread.

* Support full 32-bit txnids on 32-bit hosts.

  mti_metaref eats some txnid bits in order to stay atomic, but we can
  get them back:

  On 32-bit hosts, make mti_metaref a 64-bit value - something like 
  ((txnid << 32) | R) where R = 32-bit metaref value described before.

  When reading mti_metaref, R is authoritative for the low txnid bits.
  If (txnid << 32) part does not match, adjust it so it does: That'll
  be + or - a small value.  mti_metaref's high bits vary slowly, so
  this is normally "atomic".  txn_renew0's loop re-reads it to verify.

* We can squeeze away some bytes and mti_metaref bits, but don't get
  sucked into spending time on that before the rest is working.

  E.g. the array[4] counting pages/volatiles can be just [2] if
  Commit always toggles the (mti_metaref & 1) bit.  And we can likely
  include some of the mti_metaref fields in the txnid.

Roads to Hell:

* Support checkpointing without waiting for the write-mutex.

  Programs which have used volatiles, may want this so they can exit
  quickly.  But it requires a new shared mutex for write_meta(), or
  some clever tricks I've thought of which would not quite work.

* Rescue volatiles in a dead lockfile when the user "knows" it's safe.

  Easy enough, just don't clear them when resetting the lockfile.
  But users screw up, and may then blame LMDB.  Users who want to
  screw up, can run use MDB_NOSYNC instead of volatile commits.

  Or if volatile metas are in non-file shared memory which a system
  crash will kill, it's _almost_ safe to not reset them along with the
  lockfile.  Unless someone unmounts/mounts the disk, or replaces the
  DB by overwriting it with another DB file, or who knows what else.

-- 
Hallvard

Comment 15 OpenLDAP project 2016-09-17 06:12:07 UTC

Message#13 outlines a "volatile commit" implementation.

Comment 16 Hallvard Furuseth 2016-09-17 06:12:07 UTC

changed notes