Full_Name: Ryan Steele Version: 2.4.23 OS: Ubuntu Server URL: ftp://ftp.openldap.org/incoming/ryan-steele-110215.proxycache-failure.log Submission from: (NULL) (207.106.239.81) I use back-ldap + proxycache on many of my servers to reduce network traffic and to alleviate load on the masters, as well as to maintain service continuity in the event of a network failure. However, we have recently been noticing an issue where the proxycache database claims that it has the data and that the query is answerable, but fails to read data from any of the indices it thinks the data is at. It happens randomly, and to random entries. We do not cache negative search results, so the cache should never return nentries=0 authoritatively. I can temporarily fix it for some broken users by restarting slapd and clearing the cache (i.e., pcachePersist is set to FALSE), but inevitably others stop working (or sometimes, the same users stop working). When this happens, most entries are still served from the cache just fine, but the entries that aren't never will unless slapd is restarted. I have tested this with 2.4.17, 2.4.21, and 2.4.23, using the amd64 architecture, and with libdb4.6 and libdb4.7. Included below is the slapd.conf I use on my back-ldap + proxycache nodes, an example of the behavior using ldapsearch, and the log messages during a failed search in which using log level 16383: ## Proxycache slapd configuration # Schema include /etc/ldap/schema/core.schema include /etc/ldap/schema/collective.schema include /etc/ldap/schema/corba.schema include /etc/ldap/schema/cosine.schema include /etc/ldap/schema/duaconf.schema include /etc/ldap/schema/dyngroup.schema include /etc/ldap/schema/inetorgperson.schema include /etc/ldap/schema/java.schema include /etc/ldap/schema/misc.schema include /etc/ldap/schema/openldap.schema include /etc/ldap/schema/ppolicy.schema include /etc/ldap/schema/examplecom.schema include /etc/ldap/schema/rfc2307bis.schema include /etc/ldap/schema/samba.schema include /etc/ldap/schema/apple_auxiliary.schema include /etc/ldap/schema/apple.schema # System pidfile /var/run/slapd/slapd.pid argsfile /var/run/slapd/slapd.args loglevel stats TLSCACertificateFile /etc/ldap/ssl/certs/cacert.pem TLSCertificateFile /etc/ldap/ssl/certs/openldap.cert.pem TLSCertificateKeyFile /etc/ldap/ssl/keys/openldap.key.pem TLSVerifyClient never # Modules modulepath /usr/lib/ldap moduleload back_ldap.la moduleload back_hdb.la moduleload pcache.la # Back-LDAP database ldap uri "ldap://ldapmaster.example.com" suffix "dc=example,dc=com" rootdn "cn=admin,dc=example,dc=com" rootpw SECRET tls start # ACLs access to attrs=userPassword by tls_ssf=128 ssf=128 self write by tls_ssf=128 ssf=128 anonymous auth by tls_ssf=128 ssf=128 group/groupOfURLs/Member="cn=ops,ou=Groups,dc=example,dc=com" write by tls_ssf=128 ssf=128 * compare access to * by tls_ssf=128 ssf=128 self write by tls_ssf=128 ssf=128 group/groupOfURLs/Member="cn=ops,ou=Groups,dc=example,dc=com" write by tls_ssf=128 ssf=128 * read # ProxyCache overlay pcache proxycache hdb 500000 1 5000 86400 directory /var/lib/ldap/proxycache index cn eq index departmentName eq index entryCSN eq index entryUUID eq index gidNumber eq index mail eq index member eq index memberUid eq index objectClass eq index pcacheQueryid eq index uid eq index uidNumber eq index uniqueMember eq proxycachequeries 1000000 proxyattrset 0 apple-user-homeDirectory blogCategory cn dateCreated departmentName departmentNumber description displayColor employeeNumber gecos getsPages gidNumber givenName homeDirectory htaccessPasswd isAvailable isPhoneOperator lastAdminVisit loginShell mail manager member memberUid mobile mobileEmail numTickets objectClass ou phoneExtension sn sortOrder uid uidNumber uniqueMember userPassword proxytemplate (blogCategory=) 0 86400 proxytemplate (cn=) 0 86400 proxytemplate (dateCreated=) 0 86400 proxytemplate (departmentName=) 0 86400 proxytemplate (departmentNumber=) 0 86400 proxytemplate (description=) 0 86400 proxytemplate (displayColor=) 0 86400 proxytemplate (employeeNumber=) 0 86400 proxytemplate (gecos=) 0 86400 proxytemplate (getsPages=) 0 86400 proxytemplate (gidNumber=) 0 86400 proxytemplate (givenName=) 0 86400 proxytemplate (homeDirectory=) 0 86400 proxytemplate (apple-user-homeDirectory=) 0 86400 proxytemplate (htaccessPasswd=) 0 86400 proxytemplate (isAvailable=) 0 86400 proxytemplate (isPhoneOperator=) 0 86400 proxytemplate (lastAdminVisit=) 0 86400 proxytemplate (loginShell=) 0 86400 proxytemplate (mail=) 0 86400 proxytemplate (manager=) 0 86400 proxytemplate (member=) 0 86400 proxytemplate (memberUid=) 0 86400 proxytemplate (memberURL=) 0 86400 proxytemplate (mobile=) 0 86400 proxytemplate (mobileEmail=) 0 86400 proxytemplate (numTickets=) 0 86400 proxytemplate (objectClass=) 0 86400 proxytemplate (ou=) 0 86400 proxytemplate (phoneExtension=) 0 86400 proxytemplate (sn=) 0 86400 proxytemplate (sortOrder=) 0 86400 proxytemplate (uid=) 0 86400 proxytemplate (uidNumber=) 0 86400 proxytemplate (uniqueMember=) 0 86400 proxytemplate (|(memberUid=)(member=)) 0 86400 proxytemplate (|(memberUid=)(uniqueMember=)) 0 86400 proxytemplate (&(objectClass=)(uid=)) 0 86400 proxytemplate (&(objectClass=)(memberUid=)) 0 86400 proxytemplate (&(objectClass=)(uniqueMember=)) 0 86400 proxytemplate (&(objectClass=)(uidNumber=)) 0 86400 proxytemplate (&(objectClass=)(gidNumber=)) 0 86400 proxytemplate (&(objectClass=)(|(memberUid=)(member=))) 0 86400 proxytemplate (&(objectClass=)(|(memberUid=)(uniqueMember=))) 0 86400 proxytemplate (&(objectClass=)(member=)) 0 86400 proxytemplate (&(objectClass=)(cn=)) 0 86400 proxytemplate (&(|(objectClass=)(objectClass=))(uid=)) 0 86400 ## example of the failures using ldapsearch bash:~# for i in `seq 1 14`; do echo "PROCESSING jdoe$i"; ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe'$i'))' uid; sleep 1; done PROCESSING jdoe1 dn: uid=jdoe1,ou=Users,dc=example,dc=com uid: jdoe1 PROCESSING jdoe2 dn: uid=jdoe2,ou=Users,dc=example,dc=com uid: jdoe2 PROCESSING jdoe3 dn: uid=jdoe3,ou=Users,dc=example,dc=com uid: jdoe3 PROCESSING jdoe4 dn: uid=jdoe4,ou=Users,dc=example,dc=com uid: jdoe4 PROCESSING jdoe5 PROCESSING jdoe6 PROCESSING jdoe7 PROCESSING jdoe8 PROCESSING jdoe9 PROCESSING jdoe10 PROCESSING jdoe11 PROCESSING jdoe12 PROCESSING jdoe13 PROCESSING jdoe14 bash:~# A log file (with log level set to 16383) showing what happens when the cache is queried and it responds with "QUERY ANSWERABLE", and then fails to read data from any of the indices referenced, can be found at ftp://ftp.openldap.org/incoming/ryan-steele-110215.proxycache-failure.log. It seems similar to ITS#6242, but my version of pcache.c, at least in the 2.4.21 and 2.4.23 versions of OpenLDAP, definitely contain that patch, as I can see it in the source (the manageDSAit control). Please let me know if you need any other information to debug this problem (e.g., specific variables from a debugger run, copy of a proxycache database experiencing the problem, etc.)
This may also be related to ITS#6217 (http://www.openldap.org/its/index.cgi?findid=6217). In fact, I'm pretty sure it is. In response to the last question in that ITS posed by Howard Chu (Followup 18), the schema on the proxy servers match the schema available on the primary servers exactly. If not, it wouldn't work for some users and not others simultaneously (that is, users who have the same objectClasses and attributes, and differ only in the user-specific attribute values like uid, gecos, phone extension, etc.). Hopefully the data and logs I was able to provide (and any debugger output/cache databases that might be requested of me as the ITS moves forward) will help to fill in some of the blanks.
I believe this may be the same issue as in #6217, given this behavior with two users which have the same objectclasses, same attributes, and differ only in the attribute values (e.g., their name, phone number, and so on): The initial query, where one entry doesn't get returned as it should: ===================================================================== bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe1))' uid dn: uid=jdoe1,ou=Users,dc=example,dc=com uid: jdoe1 bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe2))' uid If I reverse the filter, I get the expected results: ===================================================================== bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe1))' uid dn: uid=jdoe1,ou=Users,dc=example,dc=com uid: jdoe1 bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe2))' uid dn: uid=jdoe2,ou=Users,dc=example,dc=com uid: jdoe2 Furthermore, if I now try the first query again, it too works: ===================================================================== bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe1))' uid dn: uid=jdoe1,ou=Users,dc=example,dc=com uid: jdoe1 bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com '(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe2))' uid dn: uid=jdoe2,ou=Users,dc=example,dc=com uid: jdoe2 I already have the patch for ITS#5756 applied in all the versions I've tested (2.4.17, 2.4.21, and 2.4.23). I can provide a sample database which exhibits this behavior if needed, as I am working on providing a self-contained sanitized example already. If the developers agree that the issue described in these two ITS's are one and the same, I'm fine with marking this bug as a duplicate of ITS#6217 to avoid fragmentation of debugging efforts.
I just uploaded two log file snippets for consideration. One is a failed query to the proxycache database exhibiting this problem, and the other is a successful query to the proxycache database. A gzipp'ed tarball can be found here: ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-logs-for-comparison.tar.gz. Up until the point where the unsuccessful query fails to read the key for the UID equality candidate, both the failed query and the successful query are locking the same exact QC index (0x7f50429860f0), trying to fetch the exact same keys ([b49d1940], [807e234d], [8f951c4a]), and comparing the same equality/filter/list candidates ("id=9, first=2699, last=2731"). This is what I would expect, since the values for the two objectClasses in the search filter are the same. However, once it gets to the point where it compares the bdb_equality_candidates for the UID, the values of the bdb_filter_candidates, bdb_equality_candidates, and bdb_list_candidates in logs for the unsuccessful query are a little confusing. I gather that a value of "0" for id/first/last is normal when it cannot find the index in question (as it does for the UID value that it said was "ANSWERABLE" but was unable to find), but the "first=2" and "first=2699" values seem a little strange, especially since the "first=2699" value supposedly matches the bdb_equality_candidates value found at one of the indices for the objectClass searches. I have no idea where the value "2" comes from, but it too seems out of place if it couldn't find the index. Perhaps those values make more sense in the context where it thought it *could* find the values, but then wasn't able to (though I admit that sounds like a pretty big stretch). ## UID portion of unsuccessful query Feb 21 10:02:00 blog2 slapd[18797]: bdb_idl_fetch_key: [8f951c4a] Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_index_read 9 candidates Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_equality_candidates: id=9, first=2699, last=2731 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=9 first=2699 last=2731 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731 Feb 21 10:02:00 blog2 slapd[18797]: => bdb_filter_candidates Feb 21 10:02:00 blog2 slapd[18797]: EQUALITY Feb 21 10:02:00 blog2 slapd[18797]: => bdb_equality_candidates (uid) Feb 21 10:02:00 blog2 slapd[18797]: => key_read Feb 21 10:02:00 blog2 slapd[18797]: bdb_idl_fetch_key: [efb8b856] Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_equality_candidates: id=0, first=0, last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=0 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=2699 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=2699 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=0 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=0 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=2 last=0 Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=2 last=0 Feb 21 10:02:00 blog2 slapd[18797]: bdb_search_candidates: id=0 first=2 last=0 Feb 21 10:02:00 blog2 slapd[18797]: hdb_search: no candidates ## UID portion of successful query Feb 21 10:06:22 blog2 slapd[18797]: bdb_idl_fetch_key: [8f951c4a] Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_index_read 9 candidates Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_equality_candidates: id=9, first=2699, last=2731 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=9 first=2699 last=2731 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731 Feb 21 10:06:22 blog2 slapd[18797]: => bdb_filter_candidates Feb 21 10:06:22 blog2 slapd[18797]: EQUALITY Feb 21 10:06:22 blog2 slapd[18797]: => bdb_equality_candidates (uid) Feb 21 10:06:22 blog2 slapd[18797]: => key_read Feb 21 10:06:22 blog2 slapd[18797]: bdb_idl_fetch_key: [ea38ed84] Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_index_read 1 candidates Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_equality_candidates: id=1, first=2701, last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: bdb_search_candidates: id=1 first=2701 last=2701 Feb 21 10:06:22 blog2 slapd[18797]: => test_filter I unfortunately don't have enough knowledge about what's going here, or where things go awry in the codebase, to abstract something more meaningful from these logs, but hopefully this provides a little bit of framing/contextual information to someone else that does.
Additionally, today I found myself in a situation where reversing the order of *just* the objectClasses in the filter - i.e., from (&(|(objectClass=foo)(objectClass=bar))(uid=jdoe1)) to (&(|(objectClass=bar)(objectClass=foo))(uid=jdoe1)) - didn't fix things, but changing the search base (from ou=Users,dc=example,dc=com to dc=example,dc=com) did. I was then able to issue the original search with searchbase ou=Users,dc=example,dc=com and filter (&(|(objectClass=foo)(objectClass=bar))(uid=jdoe1)) and it worked fine.
Some more information: once I changed the search base from ou=Users,dc=example,dc=com to dc=example,dc=com to "fix" the problem for that particular user (after which the original query with the ou=Users,dc=example,dc=com search base began working again), the following entries popped up in the log: ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-fixing-broken-entries.log.1 It gets stranger, though. If I start out by using the root entry as the search base (dc=example,dc=com) with a user who is not yet in the cache, not only will it say it is ANSWERABLE (which it clearly isn't) and then return nothing without even trying to reach out to the upstream host, but if I then change the search base to ou=Users,dc=example,dc=com, I get the following error on the CLI (and this is with a successful bind as the directory admin): ldapsearch -x -D cn=admin,dc=example,dc=com -y /etc/ldap.secret -H ldaps://localhost -LLL -b cn=admin,dc=example,dc=com '(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe4))' uid No such object (32) Matched DN: dc=example,dc=com Debugging logs generated by that query can be found here: ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-attempt-to-fix-broken-entries-error.log.1
ryans@aweber.com wrote: > Some more information: once I changed the search base from ou=Users,dc=example,dc=com to dc=example,dc=com to "fix" the > problem for that particular user (after which the original query with the ou=Users,dc=example,dc=com search base began > working again), the following entries popped up in the log: > ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-fixing-broken-entries.log.1 > > It gets stranger, though. If I start out by using the root entry as the search base (dc=example,dc=com) with a user who > is not yet in the cache, not only will it say it is ANSWERABLE (which it clearly isn't) and then return nothing without > even trying to reach out to the upstream host, but if I then change the search base to ou=Users,dc=example,dc=com, I get > the following error on the CLI (and this is with a successful bind as the directory admin): > > > ldapsearch -x -D cn=admin,dc=example,dc=com -y /etc/ldap.secret -H ldaps://localhost -LLL -b cn=admin,dc=example,dc=com > '(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe4))' uid > No such object (32) > Matched DN: dc=example,dc=com > > Debugging logs generated by that query can be found here: > ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-attempt-to-fix-broken-entries-error.log.1 I'm not sure I reproduced all the behaviors you were seeing, but I've reproduced at least one of them and fixed it in git. Please pull the latest pcache.c and test, thanks. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed state Open to Test
changed notes moved from Incoming to Software Bugs
changed notes changed state Test to Release
changed notes changed state Release to Closed
fixed in HEAD fixed in RE24