Issue 6831 - Proxycache database corruption
Summary: Proxycache database corruption
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.23
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-15 16:58 UTC by ryans@aweber.com
Modified: 2014-08-01 21:04 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description ryans@aweber.com 2011-02-15 16:58:21 UTC
Full_Name: Ryan Steele
Version: 2.4.23
OS: Ubuntu Server
URL: ftp://ftp.openldap.org/incoming/ryan-steele-110215.proxycache-failure.log
Submission from: (NULL) (207.106.239.81)


I use back-ldap + proxycache on many of my servers to reduce network traffic and
to alleviate load on the masters, as well as to maintain service continuity in
the event of a network failure.  However, we have recently been noticing an
issue where the proxycache database claims that it has the data and that the
query is answerable, but fails to read data from any of the indices it thinks
the data is at.  It happens randomly, and to random entries.  We do not cache
negative search results, so the cache should never return nentries=0
authoritatively.  I can temporarily fix it for some broken users by restarting
slapd and clearing the cache (i.e., pcachePersist is set to FALSE), but
inevitably others stop working (or sometimes, the same users stop working). 
When this happens, most entries are still served from the cache just fine, but
the entries that aren't never will unless slapd is restarted.  I have tested
this with 2.4.17, 2.4.21, and 2.4.23, using the amd64 architecture, and with
libdb4.6 and libdb4.7.

Included below is the slapd.conf I use on my back-ldap + proxycache nodes, an
example of the behavior using ldapsearch, and the log messages during a failed
search in which using log level 16383:

## Proxycache slapd configuration

# Schema
include               /etc/ldap/schema/core.schema
include               /etc/ldap/schema/collective.schema
include               /etc/ldap/schema/corba.schema
include               /etc/ldap/schema/cosine.schema
include               /etc/ldap/schema/duaconf.schema
include               /etc/ldap/schema/dyngroup.schema
include               /etc/ldap/schema/inetorgperson.schema
include               /etc/ldap/schema/java.schema
include               /etc/ldap/schema/misc.schema
include               /etc/ldap/schema/openldap.schema
include               /etc/ldap/schema/ppolicy.schema
include               /etc/ldap/schema/examplecom.schema
include               /etc/ldap/schema/rfc2307bis.schema
include               /etc/ldap/schema/samba.schema
include               /etc/ldap/schema/apple_auxiliary.schema
include               /etc/ldap/schema/apple.schema

# System
pidfile               /var/run/slapd/slapd.pid
argsfile              /var/run/slapd/slapd.args
loglevel              stats
TLSCACertificateFile  /etc/ldap/ssl/certs/cacert.pem
TLSCertificateFile    /etc/ldap/ssl/certs/openldap.cert.pem
TLSCertificateKeyFile /etc/ldap/ssl/keys/openldap.key.pem
TLSVerifyClient       never

# Modules
modulepath            /usr/lib/ldap
moduleload            back_ldap.la
moduleload            back_hdb.la
moduleload            pcache.la

# Back-LDAP
database              ldap
uri                   "ldap://ldapmaster.example.com"
suffix                "dc=example,dc=com"
rootdn                "cn=admin,dc=example,dc=com"
rootpw                SECRET
tls                   start

# ACLs
access to attrs=userPassword
   by tls_ssf=128 ssf=128 self write
   by tls_ssf=128 ssf=128 anonymous auth
   by tls_ssf=128 ssf=128
group/groupOfURLs/Member="cn=ops,ou=Groups,dc=example,dc=com" write
   by tls_ssf=128 ssf=128 * compare
access to *
   by tls_ssf=128 ssf=128 self write
   by tls_ssf=128 ssf=128
group/groupOfURLs/Member="cn=ops,ou=Groups,dc=example,dc=com" write
   by tls_ssf=128 ssf=128 * read

# ProxyCache
overlay               pcache
proxycache            hdb 500000 1 5000 86400
directory             /var/lib/ldap/proxycache

index                 cn                                                eq
index                 departmentName                                    eq
index                 entryCSN                                          eq
index                 entryUUID                                         eq
index                 gidNumber                                         eq
index                 mail                                              eq
index                 member                                            eq
index                 memberUid                                         eq
index                 objectClass                                       eq
index                 pcacheQueryid                                     eq
index                 uid                                               eq
index                 uidNumber                                         eq
index                 uniqueMember                                      eq

proxycachequeries     1000000
proxyattrset          0 apple-user-homeDirectory blogCategory cn dateCreated
departmentName departmentNumber description
displayColor employeeNumber gecos getsPages gidNumber givenName homeDirectory
htaccessPasswd isAvailable isPhoneOperator
lastAdminVisit loginShell mail manager member memberUid mobile mobileEmail
numTickets objectClass ou phoneExtension sn
sortOrder uid uidNumber uniqueMember userPassword

proxytemplate        (blogCategory=)                                    0 86400
proxytemplate        (cn=)                                              0 86400
proxytemplate        (dateCreated=)                                     0 86400
proxytemplate        (departmentName=)                                  0 86400
proxytemplate        (departmentNumber=)                                0 86400
proxytemplate        (description=)                                     0 86400
proxytemplate        (displayColor=)                                    0 86400
proxytemplate        (employeeNumber=)                                  0 86400
proxytemplate        (gecos=)                                           0 86400
proxytemplate        (getsPages=)                                       0 86400
proxytemplate        (gidNumber=)                                       0 86400
proxytemplate        (givenName=)                                       0 86400
proxytemplate        (homeDirectory=)                                   0 86400
proxytemplate        (apple-user-homeDirectory=)                        0 86400
proxytemplate        (htaccessPasswd=)                                  0 86400
proxytemplate        (isAvailable=)                                     0 86400
proxytemplate        (isPhoneOperator=)                                 0 86400
proxytemplate        (lastAdminVisit=)                                  0 86400
proxytemplate        (loginShell=)                                      0 86400
proxytemplate        (mail=)                                            0 86400
proxytemplate        (manager=)                                         0 86400
proxytemplate        (member=)                                          0 86400
proxytemplate        (memberUid=)                                       0 86400
proxytemplate        (memberURL=)                                       0 86400
proxytemplate        (mobile=)                                          0 86400
proxytemplate        (mobileEmail=)                                     0 86400
proxytemplate        (numTickets=)                                      0 86400
proxytemplate        (objectClass=)                                     0 86400
proxytemplate        (ou=)                                              0 86400
proxytemplate        (phoneExtension=)                                  0 86400
proxytemplate        (sn=)                                              0 86400
proxytemplate        (sortOrder=)                                       0 86400
proxytemplate        (uid=)                                             0 86400
proxytemplate        (uidNumber=)                                       0 86400
proxytemplate        (uniqueMember=)                                    0 86400
proxytemplate        (|(memberUid=)(member=))                           0 86400
proxytemplate        (|(memberUid=)(uniqueMember=))                     0 86400
proxytemplate        (&(objectClass=)(uid=))                            0 86400
proxytemplate        (&(objectClass=)(memberUid=))                      0 86400
proxytemplate        (&(objectClass=)(uniqueMember=))                   0 86400
proxytemplate        (&(objectClass=)(uidNumber=))                      0 86400
proxytemplate        (&(objectClass=)(gidNumber=))                      0 86400
proxytemplate        (&(objectClass=)(|(memberUid=)(member=)))          0 86400
proxytemplate        (&(objectClass=)(|(memberUid=)(uniqueMember=)))    0 86400
proxytemplate        (&(objectClass=)(member=))                         0 86400
proxytemplate        (&(objectClass=)(cn=))                             0 86400
proxytemplate        (&(|(objectClass=)(objectClass=))(uid=))           0 86400


## example of the failures using ldapsearch

bash:~# for i in `seq 1 14`; do echo "PROCESSING jdoe$i"; ldapsearch -x -H
ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe'$i'))'
uid; sleep 1; done
PROCESSING jdoe1
dn: uid=jdoe1,ou=Users,dc=example,dc=com
uid: jdoe1

PROCESSING jdoe2
dn: uid=jdoe2,ou=Users,dc=example,dc=com
uid: jdoe2

PROCESSING jdoe3
dn: uid=jdoe3,ou=Users,dc=example,dc=com
uid: jdoe3

PROCESSING jdoe4
dn: uid=jdoe4,ou=Users,dc=example,dc=com
uid: jdoe4

PROCESSING jdoe5
PROCESSING jdoe6
PROCESSING jdoe7
PROCESSING jdoe8
PROCESSING jdoe9
PROCESSING jdoe10
PROCESSING jdoe11
PROCESSING jdoe12
PROCESSING jdoe13
PROCESSING jdoe14
bash:~#


A log file (with log level set to 16383) showing what happens when the cache is
queried and it responds with "QUERY ANSWERABLE", and then fails to read data
from any of the indices referenced, can be found at
ftp://ftp.openldap.org/incoming/ryan-steele-110215.proxycache-failure.log.  It
seems similar to ITS#6242, but my version of pcache.c, at least in the 2.4.21
and 2.4.23 versions of OpenLDAP, definitely contain that patch, as I can see it
in the source (the manageDSAit control).  Please let me know if you need any
other information to debug this problem (e.g., specific variables from a
debugger run, copy of a proxycache database experiencing the problem, etc.)
Comment 1 ryans@aweber.com 2011-02-15 20:40:11 UTC
This may also be related to ITS#6217 (http://www.openldap.org/its/index.cgi?findid=6217).  In fact, I'm pretty sure it
is.  In response to the last question in that ITS posed by Howard Chu (Followup 18), the schema on the proxy servers
match the schema available on the primary servers exactly.  If not, it wouldn't work for some users and not others
simultaneously (that is, users who have the same objectClasses and attributes, and differ only in the user-specific
attribute values like uid, gecos, phone extension, etc.).  Hopefully the data and logs I was able to provide (and any
debugger output/cache databases that might be requested of me as the ITS moves forward) will help to fill in some of the
blanks.

Comment 2 ryans@aweber.com 2011-02-17 15:38:34 UTC
I believe this may be the same issue as in #6217, given this behavior with two users which have the same objectclasses,
same attributes, and differ only in the attribute values (e.g., their name, phone number, and so on):


The initial query, where one entry doesn't get returned as it should:
=====================================================================
bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe1))'
uid
dn: uid=jdoe1,ou=Users,dc=example,dc=com
uid: jdoe1

bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe2))'
uid


If I reverse the filter, I get the expected results:
=====================================================================
bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe1))'
uid
dn: uid=jdoe1,ou=Users,dc=example,dc=com
uid: jdoe1

bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe2))'
uid
dn: uid=jdoe2,ou=Users,dc=example,dc=com
uid: jdoe2


Furthermore, if I now try the first query again, it too works:
=====================================================================
bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe1))'
uid
dn: uid=jdoe1,ou=Users,dc=example,dc=com
uid: jdoe1

bash:~# ldapsearch -x -H ldaps://localhost -LLL -b ou=Users,dc=example,dc=com
'(&(|(objectClass=examplecomEmployee)(objectClass=examplecomUtilityUser))(uid=jdoe2))'
uid
dn: uid=jdoe2,ou=Users,dc=example,dc=com
uid: jdoe2


I already have the patch for ITS#5756 applied in all the versions I've tested (2.4.17, 2.4.21,
and 2.4.23).  I can provide a sample database which exhibits this behavior if needed, as I am working on providing a
self-contained sanitized example already.  If the developers agree that the issue described in these two ITS's are one
and the same, I'm fine with marking this bug as a duplicate of ITS#6217 to avoid fragmentation of debugging efforts.

Comment 3 ryans@aweber.com 2011-02-21 16:31:37 UTC
I just uploaded two log file snippets for consideration.  One is a failed query to the proxycache database exhibiting
this problem, and the other is a successful query to the proxycache database.  A gzipp'ed tarball can be found here:
ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-logs-for-comparison.tar.gz.  Up until the point where the
unsuccessful query fails to read the key for the UID equality candidate, both the failed query and the successful query
are locking the same exact QC index (0x7f50429860f0), trying to fetch the exact same keys ([b49d1940], [807e234d],
[8f951c4a]), and comparing the same equality/filter/list candidates ("id=9, first=2699, last=2731").  This is what I
would expect, since the values for the two objectClasses in the search filter are the same.

However, once it gets to the point where it compares the bdb_equality_candidates for the UID, the values of the
bdb_filter_candidates, bdb_equality_candidates, and bdb_list_candidates in logs for the unsuccessful query are a little
confusing.   I gather that a value of "0" for id/first/last is normal when it cannot find the index in question (as it
does for the UID value that it said was "ANSWERABLE" but was unable to find), but the "first=2" and "first=2699" values
seem a little strange, especially since the "first=2699" value supposedly matches the bdb_equality_candidates value
found at one of the indices for the objectClass searches.  I have no idea where the value "2" comes from, but it too
seems out of place if it couldn't find the index.  Perhaps those values make more sense in the context where it thought
it *could* find the values, but then wasn't able to (though I admit that sounds like a pretty big stretch).

## UID portion of unsuccessful query

Feb 21 10:02:00 blog2 slapd[18797]: bdb_idl_fetch_key: [8f951c4a]
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_index_read 9 candidates
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_equality_candidates: id=9, first=2699, last=2731
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=9 first=2699 last=2731
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731
Feb 21 10:02:00 blog2 slapd[18797]: => bdb_filter_candidates
Feb 21 10:02:00 blog2 slapd[18797]:     EQUALITY
Feb 21 10:02:00 blog2 slapd[18797]: => bdb_equality_candidates (uid)
Feb 21 10:02:00 blog2 slapd[18797]: => key_read
Feb 21 10:02:00 blog2 slapd[18797]: bdb_idl_fetch_key: [efb8b856]
Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_equality_candidates: id=0, first=0, last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=0 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=2699 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=2699 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=0 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=0 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_list_candidates: id=0 first=2 last=0

Feb 21 10:02:00 blog2 slapd[18797]: <= bdb_filter_candidates: id=0 first=2 last=0

Feb 21 10:02:00 blog2 slapd[18797]: bdb_search_candidates: id=0 first=2 last=0

Feb 21 10:02:00 blog2 slapd[18797]: hdb_search: no candidates


## UID portion of successful query

Feb 21 10:06:22 blog2 slapd[18797]: bdb_idl_fetch_key: [8f951c4a]
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_index_read 9 candidates
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_equality_candidates: id=9, first=2699, last=2731
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=9 first=2699 last=2731
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=9 first=2699 last=2731
Feb 21 10:06:22 blog2 slapd[18797]: => bdb_filter_candidates
Feb 21 10:06:22 blog2 slapd[18797]:     EQUALITY
Feb 21 10:06:22 blog2 slapd[18797]: => bdb_equality_candidates (uid)
Feb 21 10:06:22 blog2 slapd[18797]: => key_read
Feb 21 10:06:22 blog2 slapd[18797]: bdb_idl_fetch_key: [ea38ed84]
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_index_read 1 candidates
Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_equality_candidates: id=1, first=2701, last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_list_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: <= bdb_filter_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: bdb_search_candidates: id=1 first=2701 last=2701

Feb 21 10:06:22 blog2 slapd[18797]: => test_filter

I unfortunately don't have enough knowledge about what's going here, or where things go awry in the codebase, to
abstract something more meaningful from these logs, but hopefully this provides a little bit of framing/contextual
information to someone else that does.

Comment 4 ryans@aweber.com 2011-02-21 16:53:58 UTC
Additionally, today I found myself in a situation where reversing the order of *just* the objectClasses in the filter -
i.e., from (&(|(objectClass=foo)(objectClass=bar))(uid=jdoe1)) to (&(|(objectClass=bar)(objectClass=foo))(uid=jdoe1)) -
didn't fix things, but changing the search base (from ou=Users,dc=example,dc=com to dc=example,dc=com) did.

I was then able to issue the original search with searchbase ou=Users,dc=example,dc=com and filter
(&(|(objectClass=foo)(objectClass=bar))(uid=jdoe1)) and it worked fine.

Comment 5 ryans@aweber.com 2011-02-21 20:12:46 UTC
Some more information: once I changed the search base from ou=Users,dc=example,dc=com to dc=example,dc=com to "fix" the
problem for that particular user (after which the original query with the ou=Users,dc=example,dc=com search base began
working again), the following entries popped up in the log:
ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-fixing-broken-entries.log.1

It gets stranger, though.  If I start out by using the root entry as the search base (dc=example,dc=com) with a user who
is not yet in the cache, not only will it say it is ANSWERABLE (which it clearly isn't) and then return nothing without
even trying to reach out to the upstream host, but if I then change the search base to ou=Users,dc=example,dc=com, I get
the following error on the CLI (and this is with a successful bind as the directory admin):


ldapsearch -x -D cn=admin,dc=example,dc=com -y /etc/ldap.secret -H ldaps://localhost -LLL -b cn=admin,dc=example,dc=com
'(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe4))' uid
No such object (32)
Matched DN: dc=example,dc=com

Debugging logs generated by that query can be found here:
ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-attempt-to-fix-broken-entries-error.log.1

Comment 6 Howard Chu 2011-06-04 02:14:48 UTC
ryans@aweber.com wrote:
> Some more information: once I changed the search base from ou=Users,dc=example,dc=com to dc=example,dc=com to "fix" the
> problem for that particular user (after which the original query with the ou=Users,dc=example,dc=com search base began
> working again), the following entries popped up in the log:
> ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-fixing-broken-entries.log.1
>
> It gets stranger, though.  If I start out by using the root entry as the search base (dc=example,dc=com) with a user who
> is not yet in the cache, not only will it say it is ANSWERABLE (which it clearly isn't) and then return nothing without
> even trying to reach out to the upstream host, but if I then change the search base to ou=Users,dc=example,dc=com, I get
> the following error on the CLI (and this is with a successful bind as the directory admin):
>
>
> ldapsearch -x -D cn=admin,dc=example,dc=com -y /etc/ldap.secret -H ldaps://localhost -LLL -b cn=admin,dc=example,dc=com
> '(&(|(objectClass=examplecomUtilityUser)(objectClass=examplecomEmployee))(uid=jdoe4))' uid
> No such object (32)
> Matched DN: dc=example,dc=com
>
> Debugging logs generated by that query can be found here:
> ftp://ftp.openldap.org/incoming/ryan-steele-110221.proxycache-log-attempt-to-fix-broken-entries-error.log.1

I'm not sure I reproduced all the behaviors you were seeing, but I've 
reproduced at least one of them and fixed it in git. Please pull the latest 
pcache.c and test, thanks.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 Howard Chu 2011-06-04 02:15:37 UTC
changed state Open to Test
Comment 8 Howard Chu 2011-06-04 02:16:07 UTC
changed notes
moved from Incoming to Software Bugs
Comment 9 Quanah Gibson-Mount 2011-06-08 21:20:20 UTC
changed notes
changed state Test to Release
Comment 10 Quanah Gibson-Mount 2011-07-18 19:52:34 UTC
changed notes
changed state Release to Closed
Comment 11 OpenLDAP project 2014-08-01 21:04:36 UTC
fixed in HEAD
fixed in RE24