[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ch_malloc of 8388608 bytes failed (ITS#2270)



Update.  I applied the patch Kurt recommended to no avail.  Once again, I
came  to work this morning to my very familiar ch_malloc error.  I've
suspected all along this may have something to do with the fact that I had
built the binary using the 32 bit libraries.  But, I kind of ruled that out
because I don't (ever) get the ch_malloc errors on other 64bit Suns (280R,
for example).  It's just this one 3800 that's giving me grief.  I've played
around with the number of threads, DB_CONFIG parameters, and most blatantly
configurable options.  The reason there are so many is because I've found
that the more I allow, the longer it runs without aborting.  This machine is
configured with 8G real and 14G swap.  It has plenty of RAM to spare.   This
problem has persisted (on this machine) since its inception.  I've stayed
current with the HEAD, here, and I'm only using BDB 4.1.25 compiled in to
reduce the dependencies while I'm troubleshooting.  Bleh ...

I've tried compiling HEAD and linking with Solaris' 64bit libraries but I'm
having issues getting it to produce a binary with gcc 2.95.3 ...  I think I
need to upgrade my compiler.  I'm really trying to avoid doing anything
radical like that until I'm sure what is causing the problem.  I haven't
completely ruled out Openldap on very large machines like this (12CPU and
+20G available memory) and I'm wondering if the OS is returning (what it
considers) a valid pointer but it is somehow being considered out of range
in the code.  On the other hand, it could be the compiler or a bug in
Solaris on this architecture.  I've forwarded this issue (and others
directly related to *only* 3800s) to Sun and they assure me I am at the
latest revision of patches and these are a  "3rd party application" issue
...

I've compiled slapd a variety of ways.  With and without mtmalloc, openssl,
sasl, kerberos, zlib, etc still produces the ch_malloc abort message.   I
keep wondering about this one library it seems to only get linked with on
the 3800.  That is /usr/platform/sun4u-us3/lib/libc_psr.so.1 and I'm not
really sure what that does.  I've read some stuff on sunsolve about other
architectures having problems with their counterpart
(/usr/platform/Ultra-80/lib/libc_psr.so.1, for example) and some people have
suggested just renaming this file so it doesn't get loaded on startup.  I
may try that, too, just to see what happens, if nothing else.

Today, I plan on getting a more detailed bt full on the process and possible
step through a caught failure (it happens about every hour during peak
usage) to see if I can determine what function is aborting.  Maybe that'll
shed some light ....

Still determined,

Joseph


----- Original Message -----
From: <hyc@highlandsun.com>
To: <openldap-its@OpenLDAP.org>
Sent: Saturday, February 15, 2003 8:14 PM
Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)


> When ch_malloc fails it calls abort() to kill the process. In your stack
back
> trace, there are 232 threads but none of them is in the abort() routine,
> which I find very odd. Regardless, your problem is not due to any bug in
> OpenLDAP. The fact is, even though you have a 64 bit machine, you have
built
> a 32 bit binary. So, it is limited to a 32 bit address space, and in
Solaris,
> not all of that 32 bit space is available for user memory, only about half
of
> it (31 bits, 2GB) is available. The default size of a thread stack has
grown
> in OpenLDAP 2.1, but even in OpenLDAP 2.0 it was 2MB per thread. With the
> current 4MB per thread, times 232 threads, you have used 928MB of RAM. You
> are also using 1GB for your BDB cache. This alone (1.9GB) leaves
practically
> nothing left for slapd to run with.
>
> You should decrease the maximum number of threads; creating more beyond a
> certain limit does not enhance concurrency anyway. You can increase your
> available address space by building as a pure 64 bit executable but that
> doesn't change the fact that having too many threads will slow you down.
>
>   -- Howard Chu
>   Chief Architect, Symas Corp.       Director, Highland Sun
>   http://www.symas.com               http://highlandsun.com/hyc
>   Symas: Premier OpenSource Development and Support
>
> > -----Original Message-----
> > From: owner-openldap-bugs@OpenLDAP.org
> > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of
> > joseph.tingiris@cox.net
> > Sent: Wednesday, January 15, 2003 9:27 AM
> > To: openldap-its@OpenLDAP.org
> > Subject: ch_malloc of 8388608 bytes failed (ITS#2270)
> >
> >
> > Full_Name: Joseph Tingiris
> > Version: 2.1.12
> > OS: Solaris 8
> > URL: ftp://ftp.openldap.org/incoming/
> > Submission from: (NULL) (206.157.224.254)
> >
> >
> > I've read some of the other folks, using Solaris, having
> > similar problems and
> > I've tried almost everything I could find short of actually modifying
> > ch_malloc.c myself. It appears to be specific to
> > multiprocessor (3+) Sun
> > installations.  The binaries have been compiled with
> > -lmtmalloc and the latest
> > versions of all Openldap dependent packages are used.  The primary
> > authentication mechanism is cleartext.
> >
> > Some key points:
> >
> > * This server is a replica.
> > * BDB-4.1 with 3.4 million DNs, 6 indexes (eq,sub)
> > * process stack 32k (plimit -s), DB cache 1G (via DB_CONFIG)
> > * this problem has persisted, on the same hardware, since
> > openldap 2.0.12
> > * slapd fails at least once a day with the same error every
> > time, "ch_malloc of
> > 8388608 bytes failed"; it's always the same amount of bytes
> > * it appears to happen during a wildcard search, although it
> > may be during some
> > type of replication event
> >
> > Here is some info on the build environment:
> >
> > Application - OpenLdap and Dependencies:
> >
> > openldap-2.1.12
> > openssl-0.9.7
> > krb5-1.2.7
> > cyrus-sasl-2.1.10
> > db-4.1.25
> >
> > Compiler/Dev Tools:
> >
> > autoconf-2.57
> > automake-1.7.2
> > binutils-2.11.2
> > bison-1.75
> > fileutils-4.1
> > gawk-3.1.0
> > gcc-2.95.3
> > gdb-5.0
> > gdbm-1.8.0
> > gettext-0.10.37
> > glib-1.2.10
> > gtk+-1.2.10
> > libgcc-3.2
> > libiconv-1.6.1
> > libnet-1.0.2a
> > libpcap-0.7.1
> > libtool-1.4
> > m4-1.4
> > make-3.80
> > ncurses-5.2
> > slang-1.4.4
> > tcl-8.4.1
> > termcap-1.3
> > textutils-2.0
> > tk-8.4.1
> > zlib-1.1.4
> >
> > Here's the system info:
> >
> > System Configuration:  Sun Microsystems  sun4u Sun Fire 3800
> > System clock frequency: 150 MHz
> > Memory size: 8192 Megabytes
> >
> > ========================= CPUs
> > ===============================================
> >
> >             Port  Run    E$   CPU      CPU
> > FRU Name     ID   MHz    MB   Impl.    Mask
> > ----------  ----  ----  ----  -------  ----
> > /N0/SB0/P0    0    750   8.0  US-III   3.4
> > /N0/SB0/P1    1    750   8.0  US-III   3.4
> > /N0/SB0/P2    2    750   8.0  US-III   3.4
> > /N0/SB0/P3    3    750   8.0  US-III   3.4
> > /N0/SB2/P0    8    750   8.0  US-III   3.4
> > /N0/SB2/P1    9    750   8.0  US-III   3.4
> > /N0/SB2/P2   10    750   8.0  US-III   3.4
> > /N0/SB2/P3   11    750   8.0  US-III   3.4
> >
> > ========================= Memory Configuration
> > ===============================
> >
> >                      Logical  Logical  Logical
> >                Port  Bank     Bank     Bank         DIMM
> > Interleave
> > Interleave
> > FRU Name        ID   Num      Size     Status       Size
> > Factor      Segment
> > -------------  ----  ----     ------   -----------  ------
> > ----------
> > ----------
> > /N0/SB0/P0/B0    0    0       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P0/B0    0    2       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P1/B0    1    0       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P1/B0    1    2       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P2/B0    2    0       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P2/B0    2    2       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P3/B0    3    0       512MB    pass          256MB
> >  8-way       0
> > /N0/SB0/P3/B0    3    2       512MB    pass          256MB
> >  8-way       0
> > /N0/SB2/P0/B0    8    0       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P0/B0    8    2       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P1/B0    9    0       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P1/B0    9    2       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P2/B0   10    0       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P2/B0   10    2       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P3/B0   11    0       512MB    pass          256MB
> >  8-way       1
> > /N0/SB2/P3/B0   11    2       512MB    pass          256MB
> >  8-way       1
> >
> > ========================= IO Cards =========================
> >
> >                                 Bus  Max
> >             IO   Port Bus       Freq Bus  Dev,
> > FRU Name    Type  ID  Side Slot MHz  Freq Func State Name
> >
> >       Model
> > ----------  ---- ---- ---- ---- ---- ---- ---- -----
> > --------------------------------  ----------------------
> > /N0/IB6/P0  cPCI  24   B    2    33   33  1,0  ok
> > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > /N0/IB6/P0  cPCI  24   B    2    33   33  0,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB6/P0  cPCI  24   B    2    33   33  0,1  ok
> > SUNW,hme-pci108e,1001
> >       SUNW,cheerio
> > /N0/IB6/P0  cPCI  24   B    2    33   33  4,0  ok
> > SUNW,isptwo-pci1077,1020/sd
> > (blo+ QLGC,ISP1040B
> > /N0/IB6/P0  cPCI  24   B    3    33   33  2,0  ok
> > network-pci108e,abba.11
> >       SUNW,cpci-ce
> > /N0/IB6/P1  cPCI  25   B    4    33   33  1,0  ok
> > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > /N0/IB6/P1  cPCI  25   B    4    33   33  0,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB6/P1  cPCI  25   B    4    33   33  0,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB6/P1  cPCI  25   B    4    33   33  1,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB6/P1  cPCI  25   B    4    33   33  1,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB6/P1  cPCI  25   B    4    33   33  2,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB6/P1  cPCI  25   B    4    33   33  2,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB6/P1  cPCI  25   B    4    33   33  3,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB6/P1  cPCI  25   B    4    33   33  3,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB6/P1  cPCI  25   A    1    66   66  1,0  ok
> > fibre-channel-pci10df,f900.10df.+
> > /N0/IB8/P0  cPCI  28   B    2    33   33  1,0  ok
> > network-pci108e,abba.11
> >       SUNW,cpci-ce
> > /N0/IB8/P1  cPCI  29   B    4    33   33  1,0  ok
> > pci-pci1011,46.1/pci108e,1000     pci-bridge
> > /N0/IB8/P1  cPCI  29   B    4    33   33  0,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB8/P1  cPCI  29   B    4    33   33  0,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB8/P1  cPCI  29   B    4    33   33  1,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB8/P1  cPCI  29   B    4    33   33  1,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB8/P1  cPCI  29   B    4    33   33  2,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB8/P1  cPCI  29   B    4    33   33  2,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB8/P1  cPCI  29   B    4    33   33  3,0  ok
> > pci108e,1000-pci108e,1000.1
> > /N0/IB8/P1  cPCI  29   B    4    33   33  3,1  ok
> > SUNW,qfe-pci108e,1001
> >       SUNW,cpci-qfe
> > /N0/IB8/P1  cPCI  29   A    1    66   66  1,0  ok
> > fibre-channel-pci10df,f900.10df.+
> >
> > ========================= Active Boards for Domain
> > ===========================
> >
> >           Power  Fault  HotPlug  Board
> > FRU Name   LED    LED     LED    Cond.
> > --------  -----  -----  -------  -------
> > /N0/SB0   on     off    off      ok
> > /N0/SB2   on     off    off      ok
> > /N0/IB6   on     off    off      ok
> > /N0/IB8   on     off    off      ok
> >
> > ========================= Available Boards/Slots for Domain
> > ==================
> >
> >           Power  Fault  HotPlug  Board/Slot  Board/Slot
> > FRU Name   LED    LED     LED    Condition   Assigned
> > --------  -----  -----  -------  ----------  ----------
> > There are currently no Boards/Slots available to this Domain
> >
> > ========================= Hardware Failures
> > ==================================
> > No Hardware failures found in System
> >
> > Need any more info?  I still have pmap, lsof, truss, cores,
> > and additional debug
> > data.  Anyone have any ideas?
> >
> > Any help would be greatly appreciated.
> >
> > Thanks!
> >
> >
> >
>
>