Full_Name: Henrik Bohnenkamp Version: >= 2.4.44, HEAD OS: Ubuntu 18.04, Coreos 7.5 URL: https://github.com/hbo/openldap-mdb-deref-problem Submission from: (NULL) (77.176.95.241) This is a followup to http://www.openldap.org/lists/openldap-technical/201805/msg00065.html When MDB is used as backend database and with large DITs (O(10^6)) with many alias entries (O(10^5)), search requests with deref=always and scope=sub will take prohibitively long. Servers with a high request rate might become utterly unresponsive. This problem is not present in the HDB backend. In this issue I want to contribute two things: - a means to demonstrate the problem; this in the form of two scripts (bash/python) which set up a large test DIT and start two slapds (one HDB, one MDB) to allow easy comparisons of the performance - a patch to fix the problem The patch is certainly not production ready (or, if it is, it needs still exhaustive testing to inspire confidence), however, I think it is far enough to at least discuss the approach. Both the scripts and the patch, together with instructions how to use the former can be found here: https://github.com/hbo/openldap-mdb-deref-problem I am looking forward to discuss the patch.
--On Friday, July 06, 2018 11:20 PM +0000 hbohnenkamp@united-internet.de wrote: > Full_Name: Henrik Bohnenkamp > Version: >= 2.4.44, HEAD > OS: Ubuntu 18.04, Coreos 7.5 > URL: https://github.com/hbo/openldap-mdb-deref-problem > Submission from: (NULL) (77.176.95.241) > > > This is a followup to > http://www.openldap.org/lists/openldap-technical/201805/msg00065.html Note that this is duplicate of ITS#7657. --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
changed notes
quanah@symas.com wrote: > --On Friday, July 06, 2018 11:20 PM +0000 hbohnenkamp@united-internet.de > wrote: > >> Full_Name: Henrik Bohnenkamp >> Version: >= 2.4.44, HEAD >> OS: Ubuntu 18.04, Coreos 7.5 >> URL: https://github.com/hbo/openldap-mdb-deref-problem >> Submission from: (NULL) (77.176.95.241) >> >> >> This is a followup to >> http://www.openldap.org/lists/openldap-technical/201805/msg00065.html > > Note that this is duplicate of ITS#7657. We may have to consider this for 2.5; since it changes the format of the dn2id index it's not something we can easily put in at this late stage of 2.4. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Some months have passed but I finally found the time to set up a test rig for regression testing of my patch under LIVE conditions. I'll describe the setup below. The good news is, today I found the first bug in my patch (handling of scope=one searches is incorrect). I say "good" news because it shows that the testing setup is doing something right. In related news, it turns out that apparently the implementation of alias-dereferencing in the hdb backend seems to be faulty, too... not wrt performance, but pureley from a functional point of view. We found a search request which should return around 6700 entries, but only deliveres around 6150. My patched mdb-implementation returns more closely to 6700 entries, but for obvious reasons also that might be wrong still. I had not the time yet to find out what goes wrong in the hdb backend. I'll open a bug if I manage to reproduce it somehow under lab conditions. My test setup is like this: I use the productive LDAP content of my company of about 2.4 million Entries, slapcated from the LDAP master, to test against. All user passwd hashes are replaced with a default hash so that the test tool can bind successfully. Slapd configuration is essentially the same as for our productive slaves (esp. the ACLs). As test cases I use slapd logfiles of our productive clusters (approx 24 million log lines (stats) per server per 24h). Mostly binds and searches. Those are directed against the hdb-configured slapd and a patched-mdb-configured slapd (v 2.4.47). Results are compared: whether binds succeed or fail, and the returned search results. I will extend the tester to also include other operations, but this is waht is relevant to test the patch, I guess. In the taxonomy of test-methods I would classify this as "advanced monkey testing". While writing the test scripts I was wondering if somebody else did ever do something similar. There are of course tests in the test directory of the Openldap sources, but are there any (regression) tests made (e.g. for Openldap release candidates) which try to simulate large productive setups? -- Henrik Bohnenkamp
--On Friday, January 25, 2019 1:51 PM +0000 hbohnenkamp@united-internet.de wrote: > While writing the test scripts I was wondering if somebody else did > ever do something similar. There are of course tests in the test > directory of the Openldap sources, but are there any (regression) > tests made (e.g. for Openldap release candidates) which try to > simulate large productive setups? Hi Henrik, I've been expanding the regression test cases (See tests/data/regressions) and made it possible to run "make its" to trigger executing them prior to releases. However for obvious reasons that clearly isn't going to include large datasets. I've been quite interested in setting adding tests for large data sets that are stored and executed outside of the regular set of OpenLDAP tests and source tree but so far haven't had a specific case that required such a thing. So if this issue turns out to require testing against a large data set, it could be the first item for that work. Warm regards, Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
On Fri, Jan 25, 2019 at 06:31:41AM -0800, Quanah Gibson-Mount wrote: > --On Friday, January 25, 2019 1:51 PM +0000 hbohnenkamp@united-internet.de > wrote: > Hi, I have now done quite a lot of testing in the way described in my email from January (Followup 3). I used version 2.4.47, not the HEAD of the openldap git master branch. I have found one bug, the one mentioned already in the same email. I have added a fix for that bug to the patch directory in https://github.com/hbo/openldap-mdb-deref-problem I have come across the regression with the back-hdb backend again, which I also described in my previous email already. I am now convinced that this regression is really a problem with the back-hdb implementation, and I can show it without using my patched implementation of back-mdb. The demonstration requires a bit more work to be really presentable and should in any case go into a different bug report. Apart from these two issues I have not seen any regressions, and I am now quite confident that my patch would work reliable in the environment of my company... no guarantees for anything else :-) I have seen that Quanah has suggested to look at ITS#8875 for the 2.4.48 release. Whether there will be a fix in 2.4.48 and whether it will be based on my patch is of course up to the lead developers (the change in the DB schema is certainly a strong point against), but this patch can perhaps serve as something to test the actual fix against. Best regards Henrik -- Henrik Bohnenkamp
henrik.bohnenkamp@ionos.com wrote: > On Fri, Jan 25, 2019 at 06:31:41AM -0800, Quanah Gibson-Mount wrote: >> --On Friday, January 25, 2019 1:51 PM +0000 hbohnenkamp@united-internet.de >> wrote: >> Hi, thanks for the report and investigation. I've briefly reviewed your patches. Please squash your #3 into #2, there's no reason to preserve that. We don't use "inline" in this code. If the compiler is smart enough it will inline automatically anyway. Please drop that from your patch. No need to keep the old mdb_idscope invocation commented out in patch #2. Just delete it, we have the git history. In patch #1 delete.c your patch line #145 is incorrect. You should be using is_entry_alias(e) instead of (op->ora_e.) The entry in question is not part of the op request, only its DN is in the Delete request. Not sure why you dropped nsubs from the diskNode definition. Are you sure you haven't broken the nsubs processing in this patch? > > Hi, > > I have now done quite a lot of testing in the way described in my > email from January (Followup 3). I used version 2.4.47, not the HEAD > of the openldap git master branch. > > I have found one bug, the one mentioned already in the same email. I > have added a fix for that bug to the patch directory in > https://github.com/hbo/openldap-mdb-deref-problem > > I have come across the regression with the back-hdb backend again, > which I also described in my previous email already. I am now > convinced that this regression is really a problem with the back-hdb > implementation, and I can show it without using my patched > implementation of back-mdb. The demonstration requires a bit more work > to be really presentable and should in any case go into a different > bug report. > > Apart from these two issues I have not seen any regressions, and I am > now quite confident that my patch would work reliable in the > environment of my company... no guarantees for anything else :-) > > I have seen that Quanah has suggested to look at ITS#8875 for the > 2.4.48 release. Whether there will be a fix in 2.4.48 and whether it > will be based on my patch is of course up to the lead developers (the > change in the DB schema is certainly a strong point against), but this > patch can perhaps serve as something to test the actual fix against. > > Best regards > Henrik > > > -- > Henrik Bohnenkamp > > > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On Thu, Apr 04, 2019 at 05:32:16PM +0100, Howard Chu wrote: Hi Howard, thanks for your comments. There is now a new version of patches available in my github repository. Apart from addressing your comments, I have rebased the patches against the current master branch (quite a lot activity in the last two months, I have noticed). In particular, the new function mdb_get_aliases now also uses the global size variables for IDL dimensions, rather than the previous CPP constants. As for your comments: > > Hi, thanks for the report and investigation. I've briefly reviewed your > patches. Please squash your #3 into #2, there's no reason to preserve that. done > > We don't use "inline" in this code. If the compiler is smart enough it will > inline automatically anyway. Please drop that from your patch. done > > No need to keep the old mdb_idscope invocation commented out in patch #2. > Just delete it, we have the git history. done > > In patch #1 delete.c your patch line #145 is incorrect. You should be > using is_entry_alias(e) instead of (op->ora_e.) The entry in question is not > part of the op request, only its DN is in the Delete request. Ah, that was a mistake when squashing my commits down to the two patches... patch 1 introduced the faulty line with op->ora_e, patch 2 changed op->ora_e to e. I have cleaned that up. > > Not sure why you dropped nsubs from the diskNode definition. Are you sure you > haven't broken the nsubs processing in this patch? Note that the nsubs declaration was always commented out and had thus only a decorative function. Anyway, I have added this comment back to reduce the foot print of the patches. The current master branch with the two patches builds without problems. I have run the mdb tests in the tests directory... the last test was not applied due to missing software. All other 74 tests passed. (ubuntu 18.4). Best regards Henrik > > > -- Henrik Bohnenkamp
has patch See also ITS#7657
--On Sunday, April 07, 2019 10:15 AM +0000 henrik.bohnenkamp@ionos.com wrote: > On Thu, Apr 04, 2019 at 05:32:16PM +0100, Howard Chu wrote: > Hi Howard, > > thanks for your comments. There is now a new version of patches available > in my github repository. Apart from addressing your comments, I have > rebased the patches against the current master branch (quite a lot > activity in the last two months, I have noticed). In particular, the new > function mdb_get_aliases now also uses the global size variables for IDL > dimensions, rather than the previous CPP constants. Had a customer who was hitting this issue try out these patches -- It greatly decreases the search time (from unknown/infinite to 1 minute). Unfortunately slapd then segv's. Working on getting a test database to reproduce the issue with for a good backtrace. --Quanah -- Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: <http://www.symas.com>
On Thu, Jun 27, 2019 at 10:34:19PM -0700, Quanah Gibson-Mount wrote: > --On Sunday, April 07, 2019 10:15 AM +0000 henrik.bohnenkamp@ionos.com > wrote: > > > On Thu, Apr 04, 2019 at 05:32:16PM +0100, Howard Chu wrote: > > Had a customer who was hitting this issue try out these patches -- It > greatly decreases the search time (from unknown/infinite to 1 minute). > Unfortunately slapd then segv's. Working on getting a test database to > reproduce the issue with for a good backtrace. oh, that's too bad. I'd be happy to participate in the bug hunt. May I ask: which version did your customer use? I have modified the patch to work with the recent changes on master, but I did not dig in very deep to understand these changes. So I must admit that these modifications to the patch are completely untested (except the tests in the tests subdir). Things might work better with the 2.4.46/47 patch, if that is an option for your customer. Wrt getting a test-tree, maybe the python script I have in the github repo for the patch can provide one. (https://github.com/hbo/openldap-mdb-deref-problem/blob/master/scripts/write_tree.py) I'm of course very interested to keep this patch alive, so if you have details on the segv, please let me know. Henrik > > --Quanah > > > -- > > Quanah Gibson-Mount > Product Architect > Symas Corporation > Packaged, certified, and supported LDAP solutions powered by OpenLDAP: > <http://www.symas.com> -- Dr. rer. nat. Henrik Bohnenkamp Senior Sysadmin IT Operations Monitoring & Infrastructure 1&1 IONOS SE | Brauerstraße 48 | 76135 Karlsruhe | Germany Phone: +49 721 91374 - 4159 E-mail: henrik.bohnenkamp@ionos.com | Web: www.ionos.de Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 24498 Vorstand: Dr. Christian Böing, Hüseyin Dogan, Hans-Henning Kettler, Matthias Steinberg, Achim Weiß Aufsichtsratsvorsitzender: Markus Kadelke Member of United Internet Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden. This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail.
henrik.bohnenkamp@ionos.com wrote: > On Thu, Jun 27, 2019 at 10:34:19PM -0700, Quanah Gibson-Mount wrote: >> --On Sunday, April 07, 2019 10:15 AM +0000 henrik.bohnenkamp@ionos.com >> wrote: >> >>> On Thu, Apr 04, 2019 at 05:32:16PM +0100, Howard Chu wrote: >> >> Had a customer who was hitting this issue try out these patches -- It >> greatly decreases the search time (from unknown/infinite to 1 minute). >> Unfortunately slapd then segv's. Working on getting a test database to >> reproduce the issue with for a good backtrace. > > oh, that's too bad. I'd be happy to participate in the bug hunt. May I > ask: which version did your customer use? I have modified the patch to > work with the recent changes on master, but I did not dig in very deep > to understand these changes. So I must admit that these modifications > to the patch are completely untested (except the tests in the tests > subdir). > > Things might work better with the 2.4.46/47 patch, if that is an > option for your customer. > > Wrt getting a test-tree, maybe the python script I have in the github > repo for the patch can provide > one. (https://github.com/hbo/openldap-mdb-deref-problem/blob/master/scripts/write_tree.py) > > I'm of course very interested to keep this patch alive, so if you have > details on the segv, please let me know. Fyi, on our problematic test database with 11M entries and 3.7M aliases, a search with -a always , starting from the DB suffix, took 4 minutes without this patch, and 1235 minutes with this patch. Needless to say, that's not looking good. Still checking other test cases. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On Mon, Jul 15, 2019 at 02:26:59PM +0100, Howard Chu wrote: > > Fyi, on our problematic test database with 11M entries and 3.7M aliases, a search with -a always , starting from the > DB suffix, took 4 minutes without this patch, and 1235 minutes with this patch. > > Needless to say, that's not looking good. Still checking other test cases. Interesting, so the behavior is reversed now :-). I assume you have found an alternative approach to solve the problem. That's fine with me, I want the problem solved, not my patch integrated. I'm of course interested in how you do it. Surely you did not get the 4 minutes with a stock 2.4.48 slapd?
Henrik Bohnenkamp wrote: > On Mon, Jul 15, 2019 at 02:26:59PM +0100, Howard Chu wrote: >> >> Fyi, on our problematic test database with 11M entries and 3.7M aliases, a search with -a always , starting from the >> DB suffix, took 4 minutes without this patch, and 1235 minutes with this patch. >> >> Needless to say, that's not looking good. Still checking other test cases. > > Interesting, so the behavior is reversed now :-). I assume you have > found an alternative approach to solve the problem. That's fine with > me, I want the problem solved, not my patch integrated. I'm of course > interested in how you do it. Surely you did not get the 4 minutes with > a stock 2.4.48 slapd? For this size of DB we needed the ITS#8977 patches to accommodate larger IDLs. (I used 24 bits for IDLs, 16.7M slots) Also at this size, the IDL processing itself is the main bottleneck now. We would need to switch to bitmaps or trees to avoid this bottleneck, but that's also a much larger change than we can consider for this release. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu wrote: > Henrik Bohnenkamp wrote: >> On Mon, Jul 15, 2019 at 02:26:59PM +0100, Howard Chu wrote: >>> >>> Fyi, on our problematic test database with 11M entries and 3.7M aliases, a search with -a always , starting from the >>> DB suffix, took 4 minutes without this patch, and 1235 minutes with this patch. >>> >>> Needless to say, that's not looking good. Still checking other test cases. >> >> Interesting, so the behavior is reversed now :-). I assume you have >> found an alternative approach to solve the problem. That's fine with >> me, I want the problem solved, not my patch integrated. I'm of course >> interested in how you do it. Surely you did not get the 4 minutes with >> a stock 2.4.48 slapd? > > For this size of DB we needed the ITS#8977 patches to accommodate larger IDLs. > (I used 24 bits for IDLs, 16.7M slots) > Also at this size, the IDL processing itself is the main bottleneck now. We would > need to switch to bitmaps or trees to avoid this bottleneck, but that's also a > much larger change than we can consider for this release. > I've set up a more modest test database along the lines of ITS#7657. It has 500,000 users, 30,000 aliases total, and 435 in ou=alias2 (all the rest under ou=alias1). For unpatched back-mdb: time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias1,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 29567 # numEntries: 29566 real 0m42.504s user 0m1.344s sys 0m2.996s time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias2,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 437 # numEntries: 436 real 0m48.406s user 0m0.040s sys 0m0.076s For back-mdb with e90e8c7d3c12d897bb0584ba04dc519d4f23acf9 time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias1,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 29567 # numEntries: 29566 real 0m5.500s user 0m1.516s sys 0m2.944s time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias2,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 437 # numEntries: 436 real 0m0.399s user 0m0.048s sys 0m0.060s For back-mdb with this ITS#8875 patch time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias1,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 29567 # numEntries: 29566 real 0m6.020s user 0m1.640s sys 0m3.372s time ../clients/tools/ldapsearch -x -H ldap://:9012 -D cn=manager,dc=example,dc=com -w secret -b ou=alias2,dc=example,dc=com -a always # search result search: 2 result: 0 Success # numResponses: 437 # numEntries: 436 real 0m0.203s user 0m0.052s sys 0m0.048s It seems close enough in this case (I didn't do enough repeated runs to average out any measurement error) while the committed patch performs better on the really ugly test case. The tool to generate the test LDIF is attached. It reads an LDIF containing 500,000 users on stdin, and outputs the same LDIF, with aliases interspersed, on stdout. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com wrote: > The tool to generate the test LDIF is attached. It reads an LDIF containing 500,000 users on stdin, and outputs the same LDIF, > with aliases interspersed, on stdout. > Slightly tweaked, creates the alias after the target entry. In case the server does referential integrity on loading. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On Mon, Jul 15, 2019 at 07:44:05PM +0100, Howard Chu wrote: > Howard Chu wrote: > > Henrik Bohnenkamp wrote: > >> On Mon, Jul 15, 2019 at 02:26:59PM +0100, Howard Chu wrote: > >>> > >>> Fyi, on our problematic test database with 11M entries and 3.7M aliases, a search with -a always , starting from the > >>> DB suffix, took 4 minutes without this patch, and 1235 minutes with this patch. > >>> > >>> Needless to say, that's not looking good. Still checking other test cases. I start to understand why this is the case. I believe that our two approaches to solve the problem each have their Nemesis... I have constructed a test case where the patched slapd needs 2 minutes (1 minute on a repeat) for a search for people objects, but the master branch has yet to finish after 30 minutes (and might not finish today). The test case is modeled similar to what is done in my company: group objects have not only member entries for people, but each member has an alias to the corresponding person object, these aliases being children to the respective group object. Furthermore, subgroups/group inclusion is modeled by adding aliases (also as children) to the group object, pointing to the group objects representing the included groups. So there is some recursive descent to get all person objects of all subgroups. In total I have 5 mio entries, 3 mio of which are aliases. There are no cycles, but in this test case the depth of the recursion is absolutely unrealistic. I believe, in the master-slapd, each step down in the recursion needs a call to mdb_idscope, which in turn goes through all 3 mio aliases to compute the intersection with the scope of the group object (that's what I remember, but I have not looked at the algorithm for a year now). The patch traverses in this case just very small sub-trees (although many of those) to collect the relevant aliases and continues from there. That seems to be much more efficient here. If I understand your description above correctly, my patch sucks with your test case because it does a full traversal of the LDAP tree to collect all aliases ... with 11 mio entries that takes a while. The master branch, however, just computes intersection with the scope once and continues from there, which must be much faster. I actually do not know what to conclude from all this. I wonder however, if a combination of both approaches could be beneficial. With the patch the diskNode of an entry contains not only the number of children in the scope, but also the number of aliases. If the total number of aliases in the database is large, but the search scope is small, the tree traversal might be faster. If the number of entries in the scope is big, and there are aliases in the search scope, the intersection in mdb_idscope might be more efficient. If there are no aliases in the search scope, nothing has to be done at all.