[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#7292) Memory leak with MMR + delta syncrepl



brandon.hume@dal.ca wrote:
> Full_Name: Brandon Hume
> Version: 2.4.31
> OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64
> URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz
> Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)
>
>
> OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a
> memory leak while replicating the full database from another node in MMR.
>
> A two-node MMR configuration has been set up.  Node 1 is full populated with
> data, approximately 338k DNs, which occupies around 1G on-disk (including bdb
> __db.* and log.* files).  Node 1 is brought up and on a 64-bit system occupies
> around 5.5G VM and 4.7G RSS.
>
> Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and
> brought up with an empty database to begin replication.  Over the course of the
> replication, node 2's slapd will grow continuously.  On the one occasion it
> managed to "finish" the replication (with the test database), node 2's slapd
> occupied 14G VM and approximately 6G RSS.
>
> I've included a link to the test kit I put together.  This includes a fairly
> large, anonimized database, as well as a simplified copy of the configuration.
> I've left in the sendmail and misc schemas but removed irrelevant local schemas.
>   Also included are the DB_CONFIGs used for the main database and accesslog, and
> the configuration scripts used for compiling both bdb and OpenLDAP.
>
> Steps to reproduce:
>      - Compile and install bdb and OpenLDAP with options the same as in the
> config-db.sh and config-ldap.sh scripts.
>      - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b
> cn=config -l slapd-conf.ldif".
>      - Initialize main DB on node 1 using "slapadd -l test_dit.ldif"
>      - Start node 1.  The slapd process should stabilize at around 5G VM in use.
>      - Start node 2 and allow it to begin replication.
>
> I've tested with node 2 on both RHEL6 and on Solaris 10.  In both cases, node
> 2's slapd became extremely bloated over the course of several hours.  Only the
> Solaris SPARC box was able to complete the replication, stabilizing at 14G VM
> used.  The Redhat x86 box continued to grow far beyond the 16G swap limit and
> was killed by the OS.
>
> I've attempted to use the Solaris libumem tools to trace the memory leak, using
> gcore on the running process and "::findleaks -dv" within mdb running on the
> core.  I've included the report generated in case it provides any useful
> information as "mdb_findleaks_analysis.txt".  Disregard if you wish.
>
> (I apologize for the large test LDIF.  I wanted something to definitively show
> the problem so didn't want to trim it too much...)

Thanks for the detailed report, your test revealed several bugs. The leaks are 
now fixed in git master.

There's still another issue where node 2 starts sending the received changes 
back to node 1, even though they came from node 1 originally. This is 
triggered because most of your entries were created with sid=0, and syncprov 
doesn't know that they actually originated from node 1 (sid=1). That wastes a 
lot of CPU/network while it sends over a bunch of data that isn't needed, but 
that's all a separate issue from the memory leaks.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/