Issue 7292 - Memory leak with MMR + delta syncrepl
Summary: Memory leak with MMR + delta syncrepl
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.31
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-06 14:33 UTC by brandon.hume@dal.ca
Modified: 2014-08-01 21:04 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description brandon.hume@dal.ca 2012-06-06 14:33:16 UTC
Full_Name: Brandon Hume
Version: 2.4.31
OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64
URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz
Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)


OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a
memory leak while replicating the full database from another node in MMR.

A two-node MMR configuration has been set up.  Node 1 is full populated with
data, approximately 338k DNs, which occupies around 1G on-disk (including bdb
__db.* and log.* files).  Node 1 is brought up and on a 64-bit system occupies
around 5.5G VM and 4.7G RSS.

Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and
brought up with an empty database to begin replication.  Over the course of the
replication, node 2's slapd will grow continuously.  On the one occasion it
managed to "finish" the replication (with the test database), node 2's slapd
occupied 14G VM and approximately 6G RSS.

I've included a link to the test kit I put together.  This includes a fairly
large, anonimized database, as well as a simplified copy of the configuration. 
I've left in the sendmail and misc schemas but removed irrelevant local schemas.
 Also included are the DB_CONFIGs used for the main database and accesslog, and
the configuration scripts used for compiling both bdb and OpenLDAP.

Steps to reproduce:
    - Compile and install bdb and OpenLDAP with options the same as in the
config-db.sh and config-ldap.sh scripts.
    - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b
cn=config -l slapd-conf.ldif".
    - Initialize main DB on node 1 using "slapadd -l test_dit.ldif"
    - Start node 1.  The slapd process should stabilize at around 5G VM in use.
    - Start node 2 and allow it to begin replication.

I've tested with node 2 on both RHEL6 and on Solaris 10.  In both cases, node
2's slapd became extremely bloated over the course of several hours.  Only the
Solaris SPARC box was able to complete the replication, stabilizing at 14G VM
used.  The Redhat x86 box continued to grow far beyond the 16G swap limit and
was killed by the OS.

I've attempted to use the Solaris libumem tools to trace the memory leak, using
gcore on the running process and "::findleaks -dv" within mdb running on the
core.  I've included the report generated in case it provides any useful
information as "mdb_findleaks_analysis.txt".  Disregard if you wish.

(I apologize for the large test LDIF.  I wanted something to definitively show
the problem so didn't want to trim it too much...)
Comment 1 Quanah Gibson-Mount 2012-06-06 15:57:49 UTC
--On Wednesday, June 06, 2012 2:33 PM +0000 brandon.hume@dal.ca wrote:

> Full_Name: Brandon Hume
> Version: 2.4.31
> OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64
> URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz
> Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)
>
>
> OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to
> exhibit a memory leak while replicating the full database from another
> node in MMR.

There are definite errors in your cn=config configuration.

a) You have multiple databases numbered "1":
dn: olcDatabase={1}hdb,cn=config
dn: olcDatabase={1}monitor,cn=config

b) Syncprov overlay for accesslog:
dn: olcOverlay={0}syncprov,olcDatabase={1}hdb,cn=config
Remove the checkpoint and sessionlog settings

c) There should be no sessionlog on the primary DB with delta-syncrepl MMR:
dn: olcOverlay={0}syncprov,olcDatabase={2}hdb,cn=config
Remove olcSpSessionlog: 10000

These may not be causing the issue you are seeing, but they should be fixed 
and then the setup retested.  Of particular concern to me is item (a).  I 
would make cn=monitor be olcDatabase {3}.

--Quanah

--

Quanah Gibson-Mount
Sr. Member of Technical Staff
Zimbra, Inc
A Division of VMware, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 2 brandon.hume@dal.ca 2012-06-06 17:44:01 UTC
  On 06/ 6/12 12:57 PM, Quanah Gibson-Mount wrote:
>
> These may not be causing the issue you are seeing, but they should be 
> fixed and then the setup retested.  Of particular concern to me is 
> item (a).  I would make cn=monitor be olcDatabase {3}.

Done, thanks for pointing out the problems.  I think I introduced them 
accidentally while backend-hopping during testing, but I'll check my 
prod setup as well.

I've made the changes and retested.  The new node is still replicating, 
but after 50 cpu-minutes the process is at 10.2G and still going.  I 
believe the leak is still present.

Comment 3 Howard Chu 2012-06-07 19:16:06 UTC
changed state Open to Active
Comment 4 Howard Chu 2012-06-08 14:35:56 UTC
changed notes
changed state Active to Test
moved from Incoming to Software Bugs
Comment 5 Howard Chu 2012-06-08 14:43:43 UTC
brandon.hume@dal.ca wrote:
> Full_Name: Brandon Hume
> Version: 2.4.31
> OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64
> URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz
> Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)
>
>
> OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a
> memory leak while replicating the full database from another node in MMR.
>
> A two-node MMR configuration has been set up.  Node 1 is full populated with
> data, approximately 338k DNs, which occupies around 1G on-disk (including bdb
> __db.* and log.* files).  Node 1 is brought up and on a 64-bit system occupies
> around 5.5G VM and 4.7G RSS.
>
> Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and
> brought up with an empty database to begin replication.  Over the course of the
> replication, node 2's slapd will grow continuously.  On the one occasion it
> managed to "finish" the replication (with the test database), node 2's slapd
> occupied 14G VM and approximately 6G RSS.
>
> I've included a link to the test kit I put together.  This includes a fairly
> large, anonimized database, as well as a simplified copy of the configuration.
> I've left in the sendmail and misc schemas but removed irrelevant local schemas.
>   Also included are the DB_CONFIGs used for the main database and accesslog, and
> the configuration scripts used for compiling both bdb and OpenLDAP.
>
> Steps to reproduce:
>      - Compile and install bdb and OpenLDAP with options the same as in the
> config-db.sh and config-ldap.sh scripts.
>      - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b
> cn=config -l slapd-conf.ldif".
>      - Initialize main DB on node 1 using "slapadd -l test_dit.ldif"
>      - Start node 1.  The slapd process should stabilize at around 5G VM in use.
>      - Start node 2 and allow it to begin replication.
>
> I've tested with node 2 on both RHEL6 and on Solaris 10.  In both cases, node
> 2's slapd became extremely bloated over the course of several hours.  Only the
> Solaris SPARC box was able to complete the replication, stabilizing at 14G VM
> used.  The Redhat x86 box continued to grow far beyond the 16G swap limit and
> was killed by the OS.
>
> I've attempted to use the Solaris libumem tools to trace the memory leak, using
> gcore on the running process and "::findleaks -dv" within mdb running on the
> core.  I've included the report generated in case it provides any useful
> information as "mdb_findleaks_analysis.txt".  Disregard if you wish.
>
> (I apologize for the large test LDIF.  I wanted something to definitively show
> the problem so didn't want to trim it too much...)

Thanks for the detailed report, your test revealed several bugs. The leaks are 
now fixed in git master.

There's still another issue where node 2 starts sending the received changes 
back to node 1, even though they came from node 1 originally. This is 
triggered because most of your entries were created with sid=0, and syncprov 
doesn't know that they actually originated from node 1 (sid=1). That wastes a 
lot of CPU/network while it sends over a bunch of data that isn't needed, but 
that's all a separate issue from the memory leaks.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/


Comment 6 Quanah Gibson-Mount 2012-06-08 22:00:32 UTC
changed notes
changed state Test to Release
Comment 7 Quanah Gibson-Mount 2012-08-17 01:37:30 UTC
changed notes
changed state Release to Closed
Comment 8 OpenLDAP project 2014-08-01 21:04:43 UTC
fixed in master
fixed in RE24