[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ch_malloc of 8388608 bytes failed (ITS#2270)



Howard,

I lowered the threads back down to a sane level (64) and I got the same
ch_malloc the following morning.   When I did the math (for 64 threads), I
should have had free space left on the stack, even with the db cache (which
I also set back to the default via commenting the DB_CONFIG).   I don't
think that this is what was causing the problem directly.  It was a
workaround of mine because I felt that the daemon stayed operational longer
with max threads set way up there.  I saw it crash with as few as 10 active
threads.  The behavior was erratic.

Subsequently, I've patched the system with the following two Sun patches.

109147-21
108827-40

Then, I recompiled everything (with gcc 3.2.2 instead of 2.95.3) and linked
it using the new (via the patch) Sun ccs binaries instead of using the old
GNU binutils that I was using.  This appears to have done the trick.  The
daemon hasn't crashed in over 4 days, now.  It's breaking the corporate
record every minute.  The max threads are still set at 64 and to follow
through on my testing I plan on slowly increasing them to see ch_malloc fail
properly, as you describe, so I can see if it's the same amount of bytes.  I
think this problem had something to do with the binutils that I was using
and/or the compiler on this particular architecture.

Thanks a lot for the help and encouragement.  It is very much appreciated.
Since you mentioned it, I've been playing around with a 64bit compile of
everything (just db 4.1.25 and openldap HEAD) and have successfully built
the binaries, but I'm having a problem that I'll open another case for.  It
fails the concurrency test...

This case is closed.

Thanks Again,

Joseph

----- Original Message -----
From: "Howard Chu" <hyc@highlandsun.com>
To: "'Joseph Tingiris'" <joseph.tingiris@cox.net>;
<openldap-its@OpenLDAP.org>
Sent: Monday, February 17, 2003 3:56 PM
Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)


> Ok. Please drop your max threads parameter back down to a sane level
before
> pursuing this further, because it is a fact that with the numbers you show
> your application has definitely run out of free memory. Even though your
> machine has 8GB of RAM, your process only has 2GB of usable address space;
> the other 6GB aren't helping it at all. Let's eliminate that issue so we
can
> focus on the real problem.
>
> You need GCC 3.x to build usable 64-bit Solaris binaries. (I have tested
> successfully with GCC 3.1, after tweaking the GCC specs file.) But again,
> going there will only further obscure the issue. Stick with the current
> configuration. Any further changes you make will only make it harder to
> decipher what is really going on.
>
> libc_psr is the processor-specific runtime library, there is a different
> version for each type of Sparc architecture to handle any quirks in the
> different CPU implementations. That's why you only see that specific
libc_psr
> being used on that machine. Do not mess with it.
>
> Leave max threads at the default of 32. Run slapd under gdb. When it
aborts,
> get a full back trace of all threads.
>
>   -- Howard Chu
>   Chief Architect, Symas Corp.       Director, Highland Sun
>   http://www.symas.com               http://highlandsun.com/hyc
>   Symas: Premier OpenSource Development and Support
>
> > -----Original Message-----
> > From: Joseph Tingiris [mailto:joseph.tingiris@cox.net]
> > Sent: Monday, February 17, 2003 7:07 AM
> > To: hyc@highlandsun.com; openldap-its@OpenLDAP.org
> > Subject: Re: ch_malloc of 8388608 bytes failed (ITS#2270)
> >
> >
> > Update.  I applied the patch Kurt recommended to no avail.
> > Once again, I
> > came  to work this morning to my very familiar ch_malloc error.  I've
> > suspected all along this may have something to do with the
> > fact that I had
> > built the binary using the 32 bit libraries.  But, I kind of
> > ruled that out
> > because I don't (ever) get the ch_malloc errors on other
> > 64bit Suns (280R,
> > for example).  It's just this one 3800 that's giving me
> > grief.  I've played
> > around with the number of threads, DB_CONFIG parameters, and
> > most blatantly
> > configurable options.  The reason there are so many is
> > because I've found
> > that the more I allow, the longer it runs without aborting.
> > This machine is
> > configured with 8G real and 14G swap.  It has plenty of RAM
> > to spare.   This
> > problem has persisted (on this machine) since its inception.
> > I've stayed
> > current with the HEAD, here, and I'm only using BDB 4.1.25
> > compiled in to
> > reduce the dependencies while I'm troubleshooting.  Bleh ...
> >
> > I've tried compiling HEAD and linking with Solaris' 64bit
> > libraries but I'm
> > having issues getting it to produce a binary with gcc 2.95.3
> > ...  I think I
> > need to upgrade my compiler.  I'm really trying to avoid
> > doing anything
> > radical like that until I'm sure what is causing the problem.
> >  I haven't
> > completely ruled out Openldap on very large machines like
> > this (12CPU and
> > +20G available memory) and I'm wondering if the OS is
> > returning (what it
> > considers) a valid pointer but it is somehow being considered
> > out of range
> > in the code.  On the other hand, it could be the compiler or a bug in
> > Solaris on this architecture.  I've forwarded this issue (and others
> > directly related to *only* 3800s) to Sun and they assure me I
> > am at the
> > latest revision of patches and these are a  "3rd party
> > application" issue
> > ...
> >
> > I've compiled slapd a variety of ways.  With and without
> > mtmalloc, openssl,
> > sasl, kerberos, zlib, etc still produces the ch_malloc abort
> > message.   I
> > keep wondering about this one library it seems to only get
> > linked with on
> > the 3800.  That is /usr/platform/sun4u-us3/lib/libc_psr.so.1
> > and I'm not
> > really sure what that does.  I've read some stuff on sunsolve
> > about other
> > architectures having problems with their counterpart
> > (/usr/platform/Ultra-80/lib/libc_psr.so.1, for example) and
> > some people have
> > suggested just renaming this file so it doesn't get loaded on
> > startup.  I
> > may try that, too, just to see what happens, if nothing else.
> >
> > Today, I plan on getting a more detailed bt full on the
> > process and possible
> > step through a caught failure (it happens about every hour during peak
> > usage) to see if I can determine what function is aborting.
> > Maybe that'll
> > shed some light ....
> >
> > Still determined,
> >
> > Joseph
> >
> >
> > ----- Original Message -----
> > From: <hyc@highlandsun.com>
> > To: <openldap-its@OpenLDAP.org>
> > Sent: Saturday, February 15, 2003 8:14 PM
> > Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)
> >
> >
> > > When ch_malloc fails it calls abort() to kill the process.
> > In your stack
> > back
> > > trace, there are 232 threads but none of them is in the
> > abort() routine,
> > > which I find very odd. Regardless, your problem is not due
> > to any bug in
> > > OpenLDAP. The fact is, even though you have a 64 bit
> > machine, you have
> > built
> > > a 32 bit binary. So, it is limited to a 32 bit address space, and in
> > Solaris,
> > > not all of that 32 bit space is available for user memory,
> > only about half
> > of
> > > it (31 bits, 2GB) is available. The default size of a
> > thread stack has
> > grown
> > > in OpenLDAP 2.1, but even in OpenLDAP 2.0 it was 2MB per
> > thread. With the
> > > current 4MB per thread, times 232 threads, you have used
> > 928MB of RAM. You
> > > are also using 1GB for your BDB cache. This alone (1.9GB) leaves
> > practically
> > > nothing left for slapd to run with.
> > >
> > > You should decrease the maximum number of threads; creating
> > more beyond a
> > > certain limit does not enhance concurrency anyway. You can
> > increase your
> > > available address space by building as a pure 64 bit
> > executable but that
> > > doesn't change the fact that having too many threads will
> > slow you down.
> > >
> > >   -- Howard Chu
> > >   Chief Architect, Symas Corp.       Director, Highland Sun
> > >   http://www.symas.com               http://highlandsun.com/hyc
> > >   Symas: Premier OpenSource Development and Support
> > >
> > > > -----Original Message-----
> > > > From: owner-openldap-bugs@OpenLDAP.org
> > > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of
> > > > joseph.tingiris@cox.net
> > > > Sent: Wednesday, January 15, 2003 9:27 AM
> > > > To: openldap-its@OpenLDAP.org
> > > > Subject: ch_malloc of 8388608 bytes failed (ITS#2270)
> > > >
> > > >
> > > > Full_Name: Joseph Tingiris
> > > > Version: 2.1.12
> > > > OS: Solaris 8
> > > > URL: ftp://ftp.openldap.org/incoming/
> > > > Submission from: (NULL) (206.157.224.254)
> > > >
> > > >
> > > > I've read some of the other folks, using Solaris, having
> > > > similar problems and
> > > > I've tried almost everything I could find short of
> > actually modifying
> > > > ch_malloc.c myself. It appears to be specific to
> > > > multiprocessor (3+) Sun
> > > > installations.  The binaries have been compiled with
> > > > -lmtmalloc and the latest
> > > > versions of all Openldap dependent packages are used.  The primary
> > > > authentication mechanism is cleartext.
> > > >
> > > > Some key points:
> > > >
> > > > * This server is a replica.
> > > > * BDB-4.1 with 3.4 million DNs, 6 indexes (eq,sub)
> > > > * process stack 32k (plimit -s), DB cache 1G (via DB_CONFIG)
> > > > * this problem has persisted, on the same hardware, since
> > > > openldap 2.0.12
> > > > * slapd fails at least once a day with the same error every
> > > > time, "ch_malloc of
> > > > 8388608 bytes failed"; it's always the same amount of bytes
> > > > * it appears to happen during a wildcard search, although it
> > > > may be during some
> > > > type of replication event
> > > >
> > > > Here is some info on the build environment:
> > > >
> > > > Application - OpenLdap and Dependencies:
> > > >
> > > > openldap-2.1.12
> > > > openssl-0.9.7
> > > > krb5-1.2.7
> > > > cyrus-sasl-2.1.10
> > > > db-4.1.25
> > > >
> > > > Compiler/Dev Tools:
> > > >
> > > > autoconf-2.57
> > > > automake-1.7.2
> > > > binutils-2.11.2
> > > > bison-1.75
> > > > fileutils-4.1
> > > > gawk-3.1.0
> > > > gcc-2.95.3
> > > > gdb-5.0
> > > > gdbm-1.8.0
> > > > gettext-0.10.37
> > > > glib-1.2.10
> > > > gtk+-1.2.10
> > > > libgcc-3.2
> > > > libiconv-1.6.1
> > > > libnet-1.0.2a
> > > > libpcap-0.7.1
> > > > libtool-1.4
> > > > m4-1.4
> > > > make-3.80
> > > > ncurses-5.2
> > > > slang-1.4.4
> > > > tcl-8.4.1
> > > > termcap-1.3
> > > > textutils-2.0
> > > > tk-8.4.1
> > > > zlib-1.1.4
> > > >
> > > > Here's the system info:
> > > >
> > > > System Configuration:  Sun Microsystems  sun4u Sun Fire 3800
> > > > System clock frequency: 150 MHz
> > > > Memory size: 8192 Megabytes
> > > >
> > > > ========================= CPUs
> > > > ===============================================
> > > >
> > > >             Port  Run    E$   CPU      CPU
> > > > FRU Name     ID   MHz    MB   Impl.    Mask
> > > > ----------  ----  ----  ----  -------  ----
> > > > /N0/SB0/P0    0    750   8.0  US-III   3.4
> > > > /N0/SB0/P1    1    750   8.0  US-III   3.4
> > > > /N0/SB0/P2    2    750   8.0  US-III   3.4
> > > > /N0/SB0/P3    3    750   8.0  US-III   3.4
> > > > /N0/SB2/P0    8    750   8.0  US-III   3.4
> > > > /N0/SB2/P1    9    750   8.0  US-III   3.4
> > > > /N0/SB2/P2   10    750   8.0  US-III   3.4
> > > > /N0/SB2/P3   11    750   8.0  US-III   3.4
> > > >
> > > > ========================= Memory Configuration
> > > > ===============================
> > > >
> > > >                      Logical  Logical  Logical
> > > >                Port  Bank     Bank     Bank         DIMM
> > > > Interleave
> > > > Interleave
> > > > FRU Name        ID   Num      Size     Status       Size
> > > > Factor      Segment
> > > > -------------  ----  ----     ------   -----------  ------
> > > > ----------
> > > > ----------
> > > > /N0/SB0/P0/B0    0    0       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P0/B0    0    2       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P1/B0    1    0       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P1/B0    1    2       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P2/B0    2    0       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P2/B0    2    2       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P3/B0    3    0       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB0/P3/B0    3    2       512MB    pass          256MB
> > > >  8-way       0
> > > > /N0/SB2/P0/B0    8    0       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P0/B0    8    2       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P1/B0    9    0       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P1/B0    9    2       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P2/B0   10    0       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P2/B0   10    2       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P3/B0   11    0       512MB    pass          256MB
> > > >  8-way       1
> > > > /N0/SB2/P3/B0   11    2       512MB    pass          256MB
> > > >  8-way       1
> > > >
> > > > ========================= IO Cards =========================
> > > >
> > > >                                 Bus  Max
> > > >             IO   Port Bus       Freq Bus  Dev,
> > > > FRU Name    Type  ID  Side Slot MHz  Freq Func State Name
> > > >
> > > >       Model
> > > > ----------  ---- ---- ---- ---- ---- ---- ---- -----
> > > > --------------------------------  ----------------------
> > > > /N0/IB6/P0  cPCI  24   B    2    33   33  1,0  ok
> > > > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > > > /N0/IB6/P0  cPCI  24   B    2    33   33  0,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P0  cPCI  24   B    2    33   33  0,1  ok
> > > > SUNW,hme-pci108e,1001
> > > >       SUNW,cheerio
> > > > /N0/IB6/P0  cPCI  24   B    2    33   33  4,0  ok
> > > > SUNW,isptwo-pci1077,1020/sd
> > > > (blo+ QLGC,ISP1040B
> > > > /N0/IB6/P0  cPCI  24   B    3    33   33  2,0  ok
> > > > network-pci108e,abba.11
> > > >       SUNW,cpci-ce
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  1,0  ok
> > > > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  0,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  0,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  1,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  1,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  2,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  2,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  3,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1  cPCI  25   B    4    33   33  3,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB6/P1  cPCI  25   A    1    66   66  1,0  ok
> > > > fibre-channel-pci10df,f900.10df.+
> > > > /N0/IB8/P0  cPCI  28   B    2    33   33  1,0  ok
> > > > network-pci108e,abba.11
> > > >       SUNW,cpci-ce
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  1,0  ok
> > > > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  0,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  0,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  1,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  1,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  2,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  2,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  3,0  ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1  cPCI  29   B    4    33   33  3,1  ok
> > > > SUNW,qfe-pci108e,1001
> > > >       SUNW,cpci-qfe
> > > > /N0/IB8/P1  cPCI  29   A    1    66   66  1,0  ok
> > > > fibre-channel-pci10df,f900.10df.+
> > > >
> > > > ========================= Active Boards for Domain
> > > > ===========================
> > > >
> > > >           Power  Fault  HotPlug  Board
> > > > FRU Name   LED    LED     LED    Cond.
> > > > --------  -----  -----  -------  -------
> > > > /N0/SB0   on     off    off      ok
> > > > /N0/SB2   on     off    off      ok
> > > > /N0/IB6   on     off    off      ok
> > > > /N0/IB8   on     off    off      ok
> > > >
> > > > ========================= Available Boards/Slots for Domain
> > > > ==================
> > > >
> > > >           Power  Fault  HotPlug  Board/Slot  Board/Slot
> > > > FRU Name   LED    LED     LED    Condition   Assigned
> > > > --------  -----  -----  -------  ----------  ----------
> > > > There are currently no Boards/Slots available to this Domain
> > > >
> > > > ========================= Hardware Failures
> > > > ==================================
> > > > No Hardware failures found in System
> > > >
> > > > Need any more info?  I still have pmap, lsof, truss, cores,
> > > > and additional debug
> > > > data.  Anyone have any ideas?
> > > >
> > > > Any help would be greatly appreciated.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > >
> > >
> >
>