[Date Prev][Date Next] [Chronological] [Thread] [Top]

LMDB crash consistency, again

This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update.

This is a well-understood issue in LMDB; we briefly touched on it in this earlier email thread http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's been a topic of discussion on IRC ever since the first multi-FS microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/

It's worth noting that this vulnerability doesn't exist on Windows, MacOSX, Android, or *BSD, because none of these OSs have a function equivalent to fdatasync in the first place - they always use fsync (or the Windows equivalent). (Android is an oddball; the underlying Linux kernel of course supports fdatasync, but the C library, bionic, does not.)

We have a couple approaches for Linux:
1) provide an option to preallocate the file, using fallocate(). Unfortunately this doesn't completely eliminate metadata updates - filesystem drivers tend to try to be "smart" and make fallocate cheap; they allocate the space in the FS metadata but they also mark it as "unseen." The first time a process accesses an unseen page, it gets zeroed out. Up until that point, whatever old contents of the disk page are still present. The act of marking a page from "unseen" to "seen" requires a metadata update of its own.

We had a discussion of this FS mis-feature a while ago, but it was fruitless.

2) preallocate the file by explicitly writing zeros to it. This has a couple other disadvantages: a) on SSDs, doing such a write needlessly contributes to wearout of the flash. b) Windows detects all-zero writes and compresses them out, creating a sparse file, thus defeating the attempt at preallocation.

3) track the allocated size of the file, and toggle between fsync and fdatasync depending on whether the allocated size actually grows or not. This is the approach I'm currently taking in a development branch. Whether we add this to a new 0.9.x release, or just in 1.0, I haven't yet decided.

As another footnote, I plan to add support for LMDB on a raw partition in 1.x. Naturally, fsync vs fdatasync will be irrelevant in that case.

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/