[Date Prev][Date Next]
Re: MDB microbenchmark
Howard Chu wrote:
> Another update - http://highlandsun.com/hyc/mdb/microbench/MDB-fs.ods is an
> OpenOffice spreadsheet tabulating the results from running the benchmarks
> across many different filesystems. You can compare btrfs, ext2, ext3, ext4,
> jfs, ntfs, reiserfs, xfs, and zfs to see which is best for the database
> workloads being tested. In addition, ext3, ext4, jfs, reiserfs, and xfs are
> tested in a 2nd configuration, with the journal stored on a tmpfs device, to
> show how much overhead the filesystem's journaling mechanism imposes.
> The hard drive used is the same as in the main benchmark document, attached
> via eSATA to my laptop. The filesystems were created fresh for each test. The
> tests are only run once each due to the great length of time needed to collect
> all of the data. (It takes several minutes just to run mkfs for some of these
> filesystems...) You will probably want to toggle through the tests in cell B13
> of the spreadsheet to get the best view of the results.
> With this drive, jfs with an external journal is the clear winner when you
> need fully synchronous transactions. If you can tolerate some degree of asynch
> operation, plain old ext2 is still the fastest for writes.
If you're dedicating an entire filesystem to an MDB database, it may make
sense to just use ext2 (or turn off metadata journaling in ext3/4). In that
case, you would want to preallocate all of the disk space for the DB. Once all
of the space has been allocated and the FS has been cleanly sync'd, there
would be no further structural meta-data updates to worry about. I.e., in a
subsequent unclean shutdown, fsck would have no work to do.
>From that point on, the only meta-data updates would be updating the inode
mtime on write operations.
Note that just setting the filesize (using ftruncate()) is inadequate since
that would just create a sparse file. Also using fallocate() would only partly
serve the purpose (assuming it's even implemented on ext2) because fallocate()
marks the allocated space as unused. (So the first time a page is referenced,
it still needs to perform a meta-update to note that the page is now in use.)
The only suitable approach here is actually writing data to fill out the size
of the file. (Which is also rather unfortunate, particularly if you're using
> MDB read speed is largely independent of FS type. I believe any variation in
> the reported speeds here is just measurement noise.
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/