[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: OpenLDAP high CPU usage when performing mass changes



Maucci, Cyrille wrote:
Hi Howard,

At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue,
and switching to tcmalloc might avoid the problem.
What's the quickest way to validate this on the running-at-99%-slapd, prior to falling back on tcmalloc?

Get the gdb stack trace.

Can the proc's smaps reveal this? Like if we're seeing loads many 64MB regions?

Get the gdb stack trace.

Don't guess.

Get the gdb stack trace.

Don't bother with other ineffective diagnostic tools.

Get the gdb stack trace.

Don't google for the symptoms.

Get the gdb stack trace.

Whatever is going on, if the CPU is near 100%, then most likely whatever non-idle thread you see in the stack trace is going to show you the location of the problem.

malloc is normally a fast operation. The chance of you catching slapd inside malloc on any random stack trace is usually near zero, when all's well. If you catch slapd inside glibc malloc during one of these 100% CPU instances, then that's a fair indication. If you resume and then get another trace a few seconds later, and the trace looks the same, then that's pretty conclusive.



Thanks
++Cyrille

-----Original Message-----
From: openldap-technical-bounces@OpenLDAP.org [mailto:openldap-technical-bounces@OpenLDAP.org] On Behalf Of Howard Chu
Sent: Friday, March 16, 2012 8:32 AM
To: Jeffrey Crawford
Cc: OpenLDAP technical list
Subject: Re: OpenLDAP high CPU usage when performing mass changes

Jeffrey Crawford wrote:
We are using openldap 2.4.26 with BDB 4.8 and have replication set up
in mirror mode for our main ldap database. There are a couple of other
replicas that have a subset of the data that the main cluster has but
we are seeing the following behavior on all of them.

When performing mass updates via LDAP, lets say on the order of 30,000
entries being added to existing entries. We've noticed that the CPU
use of the slapd instances goes through the roof (between 65% and 95%
continuously), and seems to stay there until it is restarted.

When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.

At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.

The Problem is that this system has to be highly available, even for
writing and when these updates "shock" the system, the response time
goes way down when the process are turning like that. I don't think
they are trying to catch up to the data changes because if I let them
run a while after the updates are done. (Talking like 1hr) and then
restart the instances, they go back to their normal state.

If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.

So far the only way I've been able to mitigate the issues is to
reconfigure our ldap proxy instances to a machine that is having less
trouble, restart the instances that are chugging along, then repoint
the proxies back to the one just started, and start the others. Not exactly a quick operation.

I've played with cache settings for both OpenLDAP and BDB and have
gotten the frequency of this issue reduced but I can't seem to get rid
of it completely and it shows up quite often after large data
manipulations. I'm at a loss of how to debug since nothing is
crashing. Any suggestions on how to find out what's causing this would
be very helpful. The logs are not throwing any warnings or posting
messages that would seem out of the ordinary and I have played with
the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.

I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux
platforms.)



--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/