[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#9098) assert fails in meta_back_search in some cases after reconnect



Full_Name: Maxime Besson
Version: 2.4.47
OS: Debian Jessie
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (2a01:cb00:802:8400:2cbe:3c60:fca6:e50b)


I am running a meta-directory with the following DB configuration. version
2.4.47, LTB build on Ubuntu 16.04

dn: olcDatabase={1}meta,cn=config
objectClass: olcMetaConfig
objectClass: olcDatabaseConfig
objectClass: olcConfig
objectClass: top
olcDatabase: {1}meta
olcSuffix: dc=com
olcAccess: {0}to * by * read
olcRootDN: cn=admin,dc=com

dn: olcMetaSub={0}uri,olcDatabase={1}meta,cn=config
objectClass: olcMetaTargetConfig
objectClass: olcConfig
objectClass: top
olcMetaSub: {0}uri
olcDbURI: ldap://1.2.3.4/dc=example,dc=com
olcDbIDAssertBind: mode=legacy flags=non-prescriptive,proxy-authz-non-critical
bindmethod=simple binddn="cn=admin,dc=example,dc=com" credentials="XXXXX"
olcDbTimeout: 5
olcDbNetworkTimeout: 3
olcDbNretries: never
olcDbRebindAsUser: true

... 

(There are 8 backends in total)


Timeouts were added in order to avoid blocking OpenLDAP completely when one
server becomes completely unavailable. However, since I added them, the slapd
process started crashing every now and then (from a couple hours to a couple of
days), usually during small network interruptions that affect all backends: I
see plenty of reconnect logs shortly before the crashes.

The crash is always immediately preceded by the following log message:

meta_search_dobind_init[{i}]: retrying URI="{url}" DN="{DN}"

{i} is never the same, and {url} and {DN} are the correct settings for backend
i.

The crash itself is an ABRT at the following assert in back-meta/search.c:

1957						assert( candidates[ i ].sr_msgid >= 0
1958							|| candidates[ i ].sr_msgid == META_MSGID_CONNECTING );


I have analyzed several core dumps, and found that every single time slapd
crashes, sr_msgid has a value of -1 (META_MSGID_IGNORE), which indeed causes the
assert to fail. 

I found that candidates[i]->sr_flags has a value of 3 (META_CANDIDATE +
META_BINDING)

And the msc_mscflags in mc->mc_conns[ i ] are

* 0x100081 for all connections before the one that triggers the crash
* 0x100010 for the candidate that crashes the server
* 0x100080 for all connections after it


I am having trouble reproducing this in a test environment, but it happens
regularly in production, I have tried changing the timeouts, adding a
non-default bind timeout , and disabling retries (they were originally allowed)
but the crashes keep happening. Note that disabling retries (olcDbNretries:
never) still seems to lead to retries in meta_search_dobind_init, since the log
message is still there.

I cannot share the core dumps due to the sensitive information inside them.
However I would gladly extract more information from them if it can help solving
this.