[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: slapd leaking memory on i386?



On Tue, Apr 14, 2009 at 02:55:29PM -0400, John Morrissey wrote:
> On Mon, Apr 13, 2009 at 04:28:58PM -0700, Howard Chu wrote:
> > John Morrissey wrote:
> > > 2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has
> > > been stable on amd64/x86_64 for us, but I'm encountering problems with
> > > 2.4.16 on i386.
> >
> > On what OS?
> 
> Linux.
> 
> > I've been seeing similar issues on Linux, but using heap profilers / leak 
> > checkers don't turn up any results. At least in the Linux case, it 
> > appears to be general heap fragmentation caused by continual mmap/munmap 
> > of malloc'd memory blocks.
> 
> ugh. Finally caught this machine in the act. By the time I got to it, it
> had ~200 file descriptors left open (each of our consumers normally has
> 4k+), most of which were:
> 
> slapd   32667 openldap 5466u  sock        0,4            113001170 can't identify protocol
> 
> slapd accepts connections but returns no results for any operations. For
> example, strace(1) shows ldapsearch(1) connecting successfully, sending a
> bind request, and then waiting forever on the result.
> 
> Each worker thread was consuming a few percent of CPU and was in a tight
> loop failing to read from connections:
> 
> futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
> futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> send(4, "<167>Apr 14 16:59:34 slapd[32667]"..., 74, MSG_NOSIGNAL) = 74
>  | 00000  3c 31 36 37 3e 41 70 72  20 31 34 20 31 36 3a 35  <167>Apr  14 16:5 |
>  | 00010  39 3a 33 34 20 73 6c 61  70 64 5b 33 32 36 36 37  9:34 sla pd[32667 |
>  | 00020  5d 3a 20 63 6f 6e 6e 65  63 74 69 6f 6e 5f 72 65  ]: conne ction_re |
>  | 00030  61 64 28 34 35 35 30 29  3a 20 6e 6f 20 63 6f 6e  ad(4550) : no con |
>  | 00040  6e 65 63 74 69 6f 6e 21  0a 00                    nection! ..       |
> futex(0xb7c5dc4c, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x895f1b4, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
> futex(0x895f1b4, FUTEX_WAKE_PRIVATE, 1) = 0
> time(NULL)                              = 1239728375
> futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
> futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> For reference, slapd has been emitting lots of these errors, but their
> frequency increases by more than an order of magnitude when slapd begins
> to wedge:
> 
> # sed -n '/no connection/s/^Apr 14 \(..\).*/\1/p' slapd.20090414|sort|uniq -c
> [...]
>   46916 10
>   48511 11
>   50828 12
>   53492 13
>  229907 14
> 1040360 15
>  808565 16
> 
> At this point, slapd's VSIZE had increased to ~2.6GB, which was consistent
> with /proc/PID/smaps. Most of its mappings were 1Mbyte or 2Mbytes, with a
> 1GB region for its SHM cache:
> 
> # awk '/Size/ {print $2}' smaps |sort|uniq -c|sort -rn|head
>     307 2048
>     149 1024
>      67 4
>      18 8
>      13 1020
> [...]
> 
> I saved a core and backtrace of this slapd image. I blocked off all LDAP
> connections to this host, and slapd's memory consumption seems stable for
> the moment, but is still not sending responses to any LDAP operations. Is
> there anything else I can poke at to help troubleshoot this? Howard, do
> you still think this is due to churn-related heap fragmentation?

Any response to this, Howard? slapd finally consumed enough memory on this
machine that the kernel OOM killer terminated it, but this problem is
trivial for us to reproduce (happens after a few days of slapd uptime).

john
-- 
John Morrissey          _o            /\         ----  __o
jwm@horde.net        _-< \_          /  \       ----  <  \,
www.horde.net/    __(_)/_(_)________/    \_______(_) /_(_)__