[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: MDB v2: Replace meta pages with "meta position" word

To: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>
Subject: Re: MDB v2: Replace meta pages with "meta position" word
From: Howard Chu <hyc@symas.com>
Date: Sun, 11 Nov 2012 13:25:40 -0800
Cc: openldap-devel@openldap.org
In-reply-to: <hbf.20121111b9v8@bombur.uio.no>
References: <hbf.20121111b9v8@bombur.uio.no>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/19.0 Firefox/19.0 SeaMonkey/2.16a1

Hallvard Breien Furuseth wrote:

I think MDB v2 should move the variable parts of MDB_meta into the
data pages.  The datafile header would retain a word with the position
of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC.
The lockfile header would hold the position of the *last* MDB_meta.

All transactions start from the lockfile->metapos commit.  Write txns
do not reuse free pages younger than the datafile->metapos commit.

I don't see this approach reducing seek overhead. It may be able to reducesync overhead, but only if you accept the possibility of delayed syncsfailing. Overall I don't see it as any improvement for ACID compliance.


What is the real benefit?

It may be worthwhile. I just want the actual specific advantages spelled out.Other DB systems use delayed/group commit to reduce sync overhead. It's worthdoing, when your application can tolerate that type of behavior. But thiscan't be the default behavior.

mdb_env_sync() called by the user does roughly:
   size_t lastpos = lockfile->metapos;
   sync;
# define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid
   if (pos2id(env, lastpos) > pos2id(env, datafile->metapos))
     write lastpos to &datafile->metapos;
Called from mdb_txn_commit(), this may need lastpos as an argument.


Results, if I'm keeping this straight:

Setting the latest commit becomes atomic: Just change metapos.
(Field MDB_txninfo.mti_txnid goes away.)

No sync issues with copying 'MDB_db's from the meta, since the meta
will not be overwritten during the txn.

Users can sync infrequently yet preserve consistency, a generalization
of MDB_NOMETASYNC.  An application crash will then lose unsynced
commits, since resetting the lockfile must reset lockfile->metapos.
MDB cannot know if a system crash left those commits unsynced.

mdb_env_sync() needs a mutex - either its own or the write lock.
(A soft mode could trylock and do nothing if that fails.)

If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can
announce the commit at lockfile->metapos and unlock the write lock
_before_ doing mdb_env_sync.  With multiple writer threads, that's
like an ACID-safe MDB_MAPASYNC.
However, that has quirks.  I don't know how serious they are:
- mdb_txn_commit() can fail after other txns see the commit, or
   succeed but set a failure flag for other txns to react to.
   Delayed mdb_env_sync can fail today too, but it will also
   happen if mdb_env_sync cannot set datafile->metapos.
- mdb_txn_commit() may not return immediately after the commit
   becomes visible to other txns.  Unless it is set up to queue the
   {sync; set datafile->metapos} actions for a maintenance thread.


More detailed draft code, still ignoring various flags:

typedef struct MDB_meta {   /* Meta info about a commit */
     MDB_db      mm_dbs[2];
     txnid_t     mm_txnid;
     pgno_t      mm_last_pg;
} MDB_meta;

typedef struct MDB_header { /* Datafile header */
     ...;
     /* Position of last synced meta - or last known if MDB_NOSYNC */
     size_t      mh_metapos;
} MDB_header;

typedef struct MDB_txbody { /* Lockfile header */
     ...;
     /* Position of last meta, possibly not synced. Both read and write
      * txns start at this commit. Replaces the old member mtb_txnid. */
     size_t      mtb_metapos;
} MDB_txbody;

mdb_txn_commit(MDB_txn *txn) {
     ...;
     /* Commit a write txn: */
     pwritev(env->me_fd, <data pages including MDB_meta>);
     /* Make the commit visible to other txns */
     lockfile->mtb_metapos = <offset of MDB_meta in me_map>;
     unlock(write_mutex);
     /* Preserve the commit */
     mdb_env_sync(env, 0);
}

# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)

mdb_txn_sync(MDB_txn *txn, int force) {
     MDB_env *txn->mt_env;
     MDB_txninfo *txns = env->me_txns;
     enum { metapos_pos = offsetof(MDB_header, mh_metapos) };

     lock(meta_mutex);

     /* Positions of meta pages known to datafile and lockfile */
     size_t cur = *(size_t *)(env->me_map + metapos_pos);
     size_t lastpos = txns->mtb_metapos;
     int got_new = pos2id(lastpos) > pos2id(cur);

     if (force || (got_new && !(env->me_flags & MDB_NOSYNC)))
         fdatasync(env->me_fd);

     /* Make datafile catch up with pre-fdatasync lockfile */
     if (got_new)
         pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos);

     unlock(meta_mutex);
}



--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Follow-Ups:
- Re: MDB v2: Replace meta pages with "meta position" word
  - From: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>

References:
- MDB v2: Replace meta pages with "meta position" word
  - From: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>

Prev by Date: MDB v2: Replace meta pages with "meta position" word
Next by Date: Re: MDB v2: Replace meta pages with "meta position" word
Index(es):
- Chronological
- Thread