[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: better malloc strategies
- To: david_list@boreham.org
- Subject: Re: better malloc strategies
- From: Howard Chu <hyc@symas.com>
- Date: Thu, 28 Dec 2006 11:49:38 -0800
- Cc: openldap-devel@openldap.org
- In-reply-to: <45650A4A.90301@boreham.org>
- References: <200608282343.k7SNhOjt061559@cantor.openldap.org> <44F3896A.9080002@symas.com> <Pine.SOC.4.64.0608282050580.7225@toolbox.rutgers.edu> <44F64AB0.7080007@symas.com> <4563F8F2.8090109@symas.com> <4564DBF8.8010809@symas.com> <4564EA75.8000108@boreham.org> <4564F2C8.9020105@symas.com> <45650A4A.90301@boreham.org>
- User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061216 Netscape/7.2 (ax) Firefox/1.5 SeaMonkey/1.5a
David Boreham wrote:
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the
multi-threaded query times are consistently about 2x slower than the
single-threaded case. The 2x slowdown makes sense since it's only a
dual-core CPU and it's doing 4x as much work. This kinda says that the
cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by
malloc() is affecting
cache performance in the glibc case ?
If they are too close I think it is possible to thrash cache lines
between cores.
I've been tinkering with oprofile and some of the performance counters
etc... I see that with the current entry_free() that returns an entry to
the head of the free list, the same structs get re-used over and over.
This is cache-friendly on a single-core machine but causes cache
contention on a multi-core machine (because a just-freed entry tends to
get reused in a different thread). Putting freed entries at the tail of
the list avoids the contention in this case, but it sort of makes things
equally bad for all the cores. (I.e., everyone has to go out to main
memory for the structures, nobody gets any benefit from the cache.) For
the moment I'm going to leave it with entry_free() returning entries to
the head of the list.
Our current Entry structure is 80 bytes on a 64-bit machine. (Only 32
bytes on a 32 bit machine.) That's definitely not doing us any favors; I
may try padding it up to 128 bytes to see how that affects things.
Unfortunately while it may be more CPU cache-friendly, it will
definitely cost us as far as how many entries we can keep cached in RAM.
Another possibility would be to interleave the prealloc list. (E.g., 5
stripes of stride 8 would keep everything on 128 byte boundaries.)
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/