[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: better malloc strategies?

Howard Chu wrote:
Howard Chu wrote:
Running some other benchmarks today I saw something rather odd - after slapd had been running for a while, and the entry cache had been churned around, slapd would sit at 100% CPU on a new incoming connection for several minutes, inside the malloc() function, before eventually progressing. At a guess, memory has become badly fragmented due to so many entries being added and freed from the entry cache, and allocating small blocks (only 18 bytes, for the connection peername in this case) gets to be a real problem.

I've played with libhoard in the past and gotten mixed results. I wonder if this is just a particularly bad version of glibc, or something we really have to worry about. (RHEL4 based system, kernel 2.6.9-22, glibc 2.3.4, AMD64 machine, 6GB of RAM free out of 32GB at the time.)

As a first cut, I plan to recycle Entry and Attribute structures on our own free lists. That ought to reduce some of the general malloc contention, as well as some degree of the churn. Will be testing this in the next few days.

This approach helped a fair amount. Pre-allocating large chunks of memory to divvy up into the Entry and Attribute lists eliminates the per-alloc malloc library overhead for these structures. Since the glibc's malloc performance decreases as the number of allocated objects increases, this turns out to be an important win. But over the course of hundreds of runs, the slapd process size continues to grow. Of course things as innocuous as syslog() also contribute to the problem, as they malloc stdio buffers for formatting their messages.

One downside is that right now it's a very simple-minded list with a single mutex protecting the list head. So while malloc may have some measure of thread scalability, this approach doesn't really. I guess the saving grace here is that allocs and frees are extremely simple, so the locks won't be held for long.

The simplicity of the code has helped boost performance a few percent. It remains to be seen whether this will scale beyond more than a few CPUs.

Another alternative that looks very promising is to use Sun's libumem, which has been ported to Linux and Windows here http://sourceforge.net/projects/umem/ . Unfortunately the code there is not packaged and ready-to-use. It has some autoconf machinery but none of it bootstraps cleanly, it takes a lot of manual intervention to even get automake thru it. But the fair amount of hacking that's required appears to be worth it; the library seems to suffer no degradation thru continuous querying over long periods of time. Now if only it didn't rely on so many deep-system and CPU-dependent features, porting to anything non-x86 will be a pain.

Comparing what the authors have accomplished here with the goals Jong had for zone-malloc, it's very tempting to think about adopting the library and using the umem-specific APIs for managing our object caches. But given the porting issues I guess it's not realistic to consider that any time soon.

 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sun        http://highlandsun.com/hyc
 OpenLDAP Core Team            http://www.openldap.org/project/