[Date Prev][Date Next] [Chronological] [Thread] [Top]

libmdb freelist and overflow pages



Quanah Gibson-Mount wrote:
--On Monday, November 05, 2012 10:10 AM -0800 Howard Chu <hyc@symas.com>
So the issue is how to find a contiguous run of pages large enough to
satisfy the overflow page, in the current freelist. This takes us into
the realm of malloc algorithms, first-fit/best-fit/..., etc.

I think first we scan whatever freelist we have in memory, to see if a
suitable run of pages is already present.

If not, and there are additional freelists still available:
    1) we could just merge all of them, and then search again
or
    2) merge one at a time, and search again

Leaning toward #2, I suspect we don't need to coalesce all freelists all
the time.

I like the sound of #2 as well.  If you come up with a patch, I can test. ;)

--Quanah

Well, just like last LinuxCon, we've had some new input from this LinuxCon.
Theodore Ts'o (ext4 lead developer) raised the topic of Erase Blocks on flash-based storage devices. If we can ensure that our page allocations are aligned with the Erase Block size of the data store, we'll get higher write throughput on SSDs, MMC cards, etc. Erase Blocks are commonly 32KB or 64KB today, with 128KB coming soon.

So, we may want to think about chunking up our page allocations into power-of-two chunks. Perhaps as a separate environment flag setting.

Even if we don't explicitly try to form 64K chunks all the time, it may be best for us to fully coalesce all available free lists whenever a request for an overflow page arrives. So our default case of single-page requests will continue as before, overflow pages will have some chance of reusing old pages, and otherwise they'll just use new pages, as they currently do.

Interestingly, the Red Hat folks expressed a desire to adopt MDB in RPM, which currently uses BerkeleyDB. Ironically, they'd like an option to run in pure append-only mode, to allow rolling back to previous state if one package of a large upgrade fails, and the user decides to abandon the entire upgrade.

It would be simple enough to add an environment flag for append-only mode, which would skip all of the freelist management entirely. Not interested in looking at that yet, can address it later if/when movement happens on the RPM project.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/