[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Dnyamic Re: commit: ldap/servers/slapd entry.c proto-slap.h slap.h zn_malloc.c



> This sounds interesting. Problem I've always found with schemes
> like this is the lack of a system-wide policy for memory usage.
> e.g. your mechanism appears to have slapd say 'hey, seems
> to be paging going on, let's shrink my caches so that paging stops'.
> For some situations this would be the right thing to do, but
> in other cases it wouldn't (e.g. when the 'right thing' is for
> slapd to have all the memory it wants, but the filesystem cache
> should shrink or Oracle should shrink).

The adaptive caching consists of two parts: mechanism and policy.
The zone-based allocator provides a mechanism by which slapd can resize its
cache. In the previous design where entry cache entries are
allocated using malloc, the pages for the entry cache could not be
detached from slapd once alloced and the page LRU scheme was not so
effective
due to the collocation of different data structures which have different
referential locality characteristics (Entry / EntryInfo).

Different policies should be implemented for different situations. The
current policy implemented in the commit considers the double buffering in
the entry and database caches and treats the entry cache as a faster access
method mainly exploitable when free memory is available. Hence, it is not
simply giving up all memories to other apps. For system wide policy
implementation, we can rely on kernel mechanisms like CKRM (Class-based
Kernel Resource Manager; http://ckrm.sf.net) which enables proportional
sharing of system resources including memory pages in the LRU list among
multiple classes in Linux.

> Mostly I've encountered deployments where the server
> I develop is the only major application running on the
> machine, and in that case the filesystem cache is enemy #1
> (on some OS'es it'll do crazy things like cache the contents
> of the transaction log files, which by definition will never be
> read ever again). Battling the filesystem cache, and the
> VM system (e.g. Solaris will engage in a frenzy of
> write-back from the BDB page pool region to the backing
> file unless you park it in a tmpfs filesystem) for control of
> physical memory has always been a pain.
>
> However, in the scenarios that you're designing for, where
> there are many applications and possibly many OS instances
> on the same machine, the lack of system-wide enforcable
> memory policy seems to be even more problematic.

Even with the systemwide policy enforcement, it is essential to make
applications aware of memory resource usage for improved performance.

> I guess I wondered if you'd considered asking for kernel
> enhancements to 'fix' this problem ? I'm thinking of a mechanism
> where applications are registered with a central memory
> policy arbiter. The sysadmin configures that thing: filesystem
> cache is allowed 1/2 of all physical memory, and more if
> no other application wants to use it; slapd is allocated a
> minumum of 1Gbyte, and more if it's free, split 50:50 with
> the filesystem cache; Oracle gets 500Meg and no more; etc etc.

In CKRM, we can specify guarantee and limit of memory resource for each
class. The guarantee is the minimum number of physical pages a class is
guaranteed to get and the limit is the maximum number of pages a class can
get.

> Then the application can interact with the kernel such that
> the policy is enforced: it can request memory with some
> flags indicating whether this memory is permanent or shrinkable,
> and then if shrinkage is required the kernel can inform the application
> that it needs to free up some non-permanent memory.

The current adaptive caching monitors a sudden increase latency and system
swapping activity to find out whether to resize the entry cache. Another way
of doing this would be to monitor the ratio between the address space size
and the RSS.

> Mechanisms like this already exist inside OS'es (primarily for
> filesystem cache dynamic sizing), but I've never seen this done
> in userland. Perhaps these things are available in mainframe OS'es ?

Well... applications can benefit from the knowledge about physical memory
resources. I believe that the use of madvise() type hint is nothing new in
large scale databases. Also as you see in the code, the dynamically
resizable
zone-based memory allocator does not add complexity or overhead. (in fact,
there is an additional memory copy of DBT returned from BDB, but it is only
needed upon entry cache misses - it should not affect hit performance)

> BTW, I'm not sure I understand the distinction you draw between
> the entry cache and the BDB page pool with respect to VM and paging
> in your paper: they're basically both the same --- physical memory
> that user-space code allocates. The page pool can be configured
> such that it's backed by a region file as opposed to a system page file.
> Is that the difference ? I guess I'm not sure why this is significant
> because
> write-back happens in both cases, doesn't it ? e.g. when a page is read
> by BDB from a data file into the page pool, it'll get written back to the
> backing file (the region file in this case) just the same as when slapd
> malloc's the memory for a new entry in the entry cache.
> Under memory pressure, the physical pages occupied by mpool
> and by entry cache are treated identically, no ? (they'd get paged out
> and a recently created page in the mpool would be just as dirty
> as a recently created entry in the entry cache, so they'd both need
> written back). Were you thinking that the mpool maps the BDB
> data files directly ? It doesn't work that way, except in certain
> limited cases, when the database is marked as read-only.
> The difference between slapd's AVL-tree access method and
> BDB's hashing in the mpool probably will help reduce the
> number of pages touched per access though.
> Anyway, I was just curious as to what I was missing here,
> because I've sure seen plenty of paging caused by too-large
> mpool configured ;)

It is shown in the experiment that the BDB cache exhibits significantly
better performance than the entry cache once swapping occurs. It is not easy
to pinpoint the exact differences because I'm not BDB expert. The main
difference I attribute to the difference in performance is the difference in
the locality of page references in BDB and entry caches. The actual working
set size of the BDB cache seems a lot smaller than that of the entry cache.
The new entry cache design in the adaptive caching also reduces the working
set size of the entry cache since the zone memory heap only contains Entry
and DBT structures separating them from the EntryInfo AVL tree in the normal
heap. Since it becomes efficient to resize the entry cache with the
zone-based adaptive cache, one can rely more on the entry cache by shifting
memory from BDB cache to the entry cache in configurations where it is not
easy to avoid swapping from occurring.

- Jong-Hyuk