[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: mdb meta pages



Hallvard Breien Furuseth wrote:
Could writing a word/byte to the current meta page break someting?

While I'm asking, why are metas separate pages, instead of simply
a fixed 256 or so bytes apart to keep them in separate cachelines?

The only reason I can think of is if a write gets garbled, the
other meta page is safe - but mdb assumes correct filesystem
operation anyway.

Because the fundamental unit of storage is a page. Writing to anything smaller than a page requires the OS to read a full page and then update the portion of it. Doing so from multiple processes would require file locking to prevent corruption. Writes to separate pages are guaranteed not to interfere with each other.

This is for a "syncdelay<count>" feature to replace "dbnosync".
The latter can break DB consistency after a system crash: Without
fdatasync(), the OS can reorder writes, leaving meta pages to
refer to trees with not written or overwritten data pages.

This should not be a new keyword. Just implement the <size> feature of the checkpoint keyword.

syncdelay<count>  will only sync every<count>  or maybe<count>/2
commit.  It'll need 4 meta pages, of which 2 may refer to unsynced
data pages.  mdb_env_sync() may then need to write a "synced" flag
to the current meta page, or do a dummy write transaction which
sets with a "synced data pages" flag in its meta page.  The latter
would have to wait out any existing/pending write transactions.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/