Full_Name: Brandon Hume Version: 2.4.31 OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64 URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e) OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a memory leak while replicating the full database from another node in MMR. A two-node MMR configuration has been set up. Node 1 is full populated with data, approximately 338k DNs, which occupies around 1G on-disk (including bdb __db.* and log.* files). Node 1 is brought up and on a 64-bit system occupies around 5.5G VM and 4.7G RSS. Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and brought up with an empty database to begin replication. Over the course of the replication, node 2's slapd will grow continuously. On the one occasion it managed to "finish" the replication (with the test database), node 2's slapd occupied 14G VM and approximately 6G RSS. I've included a link to the test kit I put together. This includes a fairly large, anonimized database, as well as a simplified copy of the configuration. I've left in the sendmail and misc schemas but removed irrelevant local schemas. Also included are the DB_CONFIGs used for the main database and accesslog, and the configuration scripts used for compiling both bdb and OpenLDAP. Steps to reproduce: - Compile and install bdb and OpenLDAP with options the same as in the config-db.sh and config-ldap.sh scripts. - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b cn=config -l slapd-conf.ldif". - Initialize main DB on node 1 using "slapadd -l test_dit.ldif" - Start node 1. The slapd process should stabilize at around 5G VM in use. - Start node 2 and allow it to begin replication. I've tested with node 2 on both RHEL6 and on Solaris 10. In both cases, node 2's slapd became extremely bloated over the course of several hours. Only the Solaris SPARC box was able to complete the replication, stabilizing at 14G VM used. The Redhat x86 box continued to grow far beyond the 16G swap limit and was killed by the OS. I've attempted to use the Solaris libumem tools to trace the memory leak, using gcore on the running process and "::findleaks -dv" within mdb running on the core. I've included the report generated in case it provides any useful information as "mdb_findleaks_analysis.txt". Disregard if you wish. (I apologize for the large test LDIF. I wanted something to definitively show the problem so didn't want to trim it too much...)
--On Wednesday, June 06, 2012 2:33 PM +0000 brandon.hume@dal.ca wrote: > Full_Name: Brandon Hume > Version: 2.4.31 > OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64 > URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz > Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e) > > > OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to > exhibit a memory leak while replicating the full database from another > node in MMR. There are definite errors in your cn=config configuration. a) You have multiple databases numbered "1": dn: olcDatabase={1}hdb,cn=config dn: olcDatabase={1}monitor,cn=config b) Syncprov overlay for accesslog: dn: olcOverlay={0}syncprov,olcDatabase={1}hdb,cn=config Remove the checkpoint and sessionlog settings c) There should be no sessionlog on the primary DB with delta-syncrepl MMR: dn: olcOverlay={0}syncprov,olcDatabase={2}hdb,cn=config Remove olcSpSessionlog: 10000 These may not be causing the issue you are seeing, but they should be fixed and then the setup retested. Of particular concern to me is item (a). I would make cn=monitor be olcDatabase {3}. --Quanah -- Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
On 06/ 6/12 12:57 PM, Quanah Gibson-Mount wrote: > > These may not be causing the issue you are seeing, but they should be > fixed and then the setup retested. Of particular concern to me is > item (a). I would make cn=monitor be olcDatabase {3}. Done, thanks for pointing out the problems. I think I introduced them accidentally while backend-hopping during testing, but I'll check my prod setup as well. I've made the changes and retested. The new node is still replicating, but after 50 cpu-minutes the process is at 10.2G and still going. I believe the leak is still present.
changed state Open to Active
changed notes changed state Active to Test moved from Incoming to Software Bugs
brandon.hume@dal.ca wrote: > Full_Name: Brandon Hume > Version: 2.4.31 > OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64 > URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz > Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e) > > > OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a > memory leak while replicating the full database from another node in MMR. > > A two-node MMR configuration has been set up. Node 1 is full populated with > data, approximately 338k DNs, which occupies around 1G on-disk (including bdb > __db.* and log.* files). Node 1 is brought up and on a 64-bit system occupies > around 5.5G VM and 4.7G RSS. > > Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and > brought up with an empty database to begin replication. Over the course of the > replication, node 2's slapd will grow continuously. On the one occasion it > managed to "finish" the replication (with the test database), node 2's slapd > occupied 14G VM and approximately 6G RSS. > > I've included a link to the test kit I put together. This includes a fairly > large, anonimized database, as well as a simplified copy of the configuration. > I've left in the sendmail and misc schemas but removed irrelevant local schemas. > Also included are the DB_CONFIGs used for the main database and accesslog, and > the configuration scripts used for compiling both bdb and OpenLDAP. > > Steps to reproduce: > - Compile and install bdb and OpenLDAP with options the same as in the > config-db.sh and config-ldap.sh scripts. > - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b > cn=config -l slapd-conf.ldif". > - Initialize main DB on node 1 using "slapadd -l test_dit.ldif" > - Start node 1. The slapd process should stabilize at around 5G VM in use. > - Start node 2 and allow it to begin replication. > > I've tested with node 2 on both RHEL6 and on Solaris 10. In both cases, node > 2's slapd became extremely bloated over the course of several hours. Only the > Solaris SPARC box was able to complete the replication, stabilizing at 14G VM > used. The Redhat x86 box continued to grow far beyond the 16G swap limit and > was killed by the OS. > > I've attempted to use the Solaris libumem tools to trace the memory leak, using > gcore on the running process and "::findleaks -dv" within mdb running on the > core. I've included the report generated in case it provides any useful > information as "mdb_findleaks_analysis.txt". Disregard if you wish. > > (I apologize for the large test LDIF. I wanted something to definitively show > the problem so didn't want to trim it too much...) Thanks for the detailed report, your test revealed several bugs. The leaks are now fixed in git master. There's still another issue where node 2 starts sending the received changes back to node 1, even though they came from node 1 originally. This is triggered because most of your entries were created with sid=0, and syncprov doesn't know that they actually originated from node 1 (sid=1). That wastes a lot of CPU/network while it sends over a bunch of data that isn't needed, but that's all a separate issue from the memory leaks. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Test to Release
changed notes changed state Release to Closed
fixed in master fixed in RE24