[Date Prev][Date Next] [Chronological] [Thread] [Top]

MDB v2: Replace meta pages with "meta position" word



I think MDB v2 should move the variable parts of MDB_meta into the
data pages.  The datafile header would retain a word with the position
of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC.
The lockfile header would hold the position of the *last* MDB_meta.

All transactions start from the lockfile->metapos commit.  Write txns
do not reuse free pages younger than the datafile->metapos commit.

mdb_env_sync() called by the user does roughly:
  size_t lastpos = lockfile->metapos;
  sync;
# define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid
  if (pos2id(env, lastpos) > pos2id(env, datafile->metapos))
    write lastpos to &datafile->metapos;
Called from mdb_txn_commit(), this may need lastpos as an argument.


Results, if I'm keeping this straight:

Setting the latest commit becomes atomic: Just change metapos.
(Field MDB_txninfo.mti_txnid goes away.)

No sync issues with copying 'MDB_db's from the meta, since the meta
will not be overwritten during the txn.

Users can sync infrequently yet preserve consistency, a generalization
of MDB_NOMETASYNC.  An application crash will then lose unsynced
commits, since resetting the lockfile must reset lockfile->metapos.
MDB cannot know if a system crash left those commits unsynced.

mdb_env_sync() needs a mutex - either its own or the write lock.
(A soft mode could trylock and do nothing if that fails.)

If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can
announce the commit at lockfile->metapos and unlock the write lock
_before_ doing mdb_env_sync.  With multiple writer threads, that's
like an ACID-safe MDB_MAPASYNC.
However, that has quirks.  I don't know how serious they are:
- mdb_txn_commit() can fail after other txns see the commit, or
  succeed but set a failure flag for other txns to react to.
  Delayed mdb_env_sync can fail today too, but it will also
  happen if mdb_env_sync cannot set datafile->metapos.
- mdb_txn_commit() may not return immediately after the commit
  becomes visible to other txns.  Unless it is set up to queue the
  {sync; set datafile->metapos} actions for a maintenance thread.


More detailed draft code, still ignoring various flags:

typedef struct MDB_meta {   /* Meta info about a commit */
    MDB_db      mm_dbs[2];
    txnid_t     mm_txnid;
    pgno_t      mm_last_pg;
} MDB_meta;

typedef struct MDB_header { /* Datafile header */
    ...;
    /* Position of last synced meta - or last known if MDB_NOSYNC */
    size_t      mh_metapos;
} MDB_header;

typedef struct MDB_txbody { /* Lockfile header */
    ...;
    /* Position of last meta, possibly not synced. Both read and write
     * txns start at this commit. Replaces the old member mtb_txnid. */
    size_t      mtb_metapos;
} MDB_txbody;

mdb_txn_commit(MDB_txn *txn) {
    ...;
    /* Commit a write txn: */
    pwritev(env->me_fd, <data pages including MDB_meta>);
    /* Make the commit visible to other txns */
    lockfile->mtb_metapos = <offset of MDB_meta in me_map>;
    unlock(write_mutex);
    /* Preserve the commit */
    mdb_env_sync(env, 0);
}

# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)

mdb_txn_sync(MDB_txn *txn, int force) {
    MDB_env *txn->mt_env;
    MDB_txninfo *txns = env->me_txns;
    enum { metapos_pos = offsetof(MDB_header, mh_metapos) };

    lock(meta_mutex);

    /* Positions of meta pages known to datafile and lockfile */
    size_t cur = *(size_t *)(env->me_map + metapos_pos);
    size_t lastpos = txns->mtb_metapos;
    int got_new = pos2id(lastpos) > pos2id(cur);

    if (force || (got_new && !(env->me_flags & MDB_NOSYNC)))
        fdatasync(env->me_fd);

    /* Make datafile catch up with pre-fdatasync lockfile */
    if (got_new)
        pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos);

    unlock(meta_mutex);
}

-- 
Hallvard