[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Ldap simple bind problems on slaves during network outage (chaining)

To: openldap-technical@openldap.org
Subject: Re: Ldap simple bind problems on slaves during network outage (chaining)
From: Christian Kratzer <ck-lists@cksoft.de>
Date: Tue, 3 Dec 2013 18:15:29 +0100 (CET)
In-reply-to: <alpine.BSF.2.00.1312031012590.43253@pohjola.cksoft.de>
References: <alpine.BSF.2.00.1312031012590.43253@pohjola.cksoft.de>
User-agent: Alpine 2.00 (BSF 1167 2008-08-23)

Hi,

On Tue, 3 Dec 2013, Christian Kratzer wrote:

Hi,
we are currently chasing a strange issue at a customers site where the ldapslaves become unresponsive when network connectivity to master ldaps and dnsservers is lost.
They have a setup of two masters and two slaves at separate sites. There isa load balancer sitting in front of the slaves that performs regular healthchecks consisting of binds followed by a search of their binddn.



It seems that this is due to ldap chaining from slave to master running without a timeout and eventually blocking all of slapd.

We use referrals and chaining for slapo-ppolicy and slapo-lastbind (with replication patch from ITS#7721).

I tried to resolve this using olcDbKeepalive and olcDbKeepalive but have not been sucessfull yet.

This is how the chaining backend is configured on the slaves in our lab:

olcDatabase={1}ldap,olcOverlay={0}chain,olcDatabase={-1}frontend,cn=config
objectClass: olcLDAPConfig
objectClass: olcChainDatabase
olcDatabase: {1}ldap
olcDbURI: "ldap://ldaptest0.example.org";
olcDbStartTLS: start   starttls=no  tls_cert="/usr/local/etc/openldap/certs/server.cert"  tls_key="/usr/local/etc/openldap/certs/server.key"  tls_cacert="/usr/local/etc/openldap/certs/ca.cert"  tls_reqcert=demand  tls_crlcheck=none
olcDbIDAssertBind: mode=self  flags=prescriptive,proxy-authz-non-critical  bindmethod=simple  binddn="cn=chain,ou=system,dc=de,dc=telefonica,dc=com"  credentials="XXXXXXXXXXX"  keepalive=60:6:10  tls_cert="/usr/local/etc/openldap/certs/server.cert"  tls_key="/usr/local/etc/openldap/certs/server.key"  tls_cacert="/usr/local/etc/openldap/certs/ca.cert" tls_reqcert=demand  tls_crlcheck=none
olcDbRebindAsUser: FALSE
olcDbChaseReferrals: TRUE
olcDbTFSupport: no
olcDbProxyWhoAmI: FALSE
olcDbProtocolVersion: 3
olcDbSingleConn: FALSE
olcDbCancel: abandon
olcDbUseTemporaryConn: FALSE
olcDbConnectionPoolMax: 16
olcDbSessionTrackingRequest: FALSE
olcDbNoRefs: FALSE
olcDbNoUndefFilter: FALSE
olcDbOnErr: continue
olcDbKeepalive: 60:6:10
olcDbNetworkTimeout: 3

Any pointers on what we should change to allow quick detection of unreachable olcDbURI ?


Greetings
Christian

During regular operations the load balancers health checks look as follows[1]
Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 ACCEPT fromIP=192.0.2.189:33852 (IP=192.0.2.129:389)Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BINDdn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BINDdn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 RESULT tag=97 err=0text=Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SRCHbase="ou=system,dc=example,dc=org" scope=1 deref=0filter="(cn=keepalive-check-lb)"Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 ENTRYdn="cn=keepalive-check-lb,ou=system,dc=example,dc=org"Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SEARCH RESULT tag=101err=0 nentries=1 text=
 Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=2 UNBIND
Dec 2 14:38:05 ldap slapd[57585]: connection_closing: readyingconn=3924716 sd=36 for closeDec 2 14:38:05 ldap slapd[57585]: connection_resched: attempting closingconn=3924716 sd=36
 Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 closed
When they experience a network outage separating the slaves from the mastersand the dns servers the load balancers are not able to bind the slaves:
Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 ACCEPT fromIP=192.0.2.188:35761 (IP=192.0.2.129:389)Dec 2 14:38:50 ldap slapd[57585]: connection_closing: readyingconn=3924725 sd=44 for closeDec 2 14:38:50 ldap slapd[57585]: connection_close: deferring conn=3924725sd=44Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BINDdn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BINDdn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0Dec 2 14:38:50 ldap slapd[57585]: connection_resched: attempting closingconn=3924725 sd=44Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 closed (connectionlost)
We have not been able to reproduce this problem in a lab setup which issupposed to be identical to the production setup. It does not seem to berelated to the servers not being able to perform reverse mapping on theclient ips. We run a mixture of 2.4.35 and 2.4.38 on CentOS 6.4. In the labthe slaves are able to perform queries just fine without connectivity to themasters or to their dns servers.
The servers are currently running with following loglevel:

 dn: cn=config
 olcLogLevel: Conns
 olcLogLevel: Stats
 olcLogLevel: Stats2
 olcLogLevel: Sync
It seems we only get to the point where the bind credentials are parsed afterwhich the connection is closed.
This could of course be a problem with the load balancer prematurely closingthe connection.
I am trying to eliminate any causes in the ldap servers.

Any ideas on how to debug this or where we could look.

Greetings
Christian

[1] dns and ips obfuscated to protect the customer


--
Christian Kratzer                      CK Software GmbH
Email:   ck@cksoft.de                  Wildberger Weg 24/2
Phone:   +49 7032 893 997 - 0          D-71126 Gaeufelden
Fax:     +49 7032 893 997 - 9          HRB 245288, Amtsgericht Stuttgart
Web:     http://www.cksoft.de/         Geschaeftsfuehrer: Christian Kratzer

Follow-Ups:
- Re: Ldap simple bind problems on slaves during network outage (chaining)
  - From: Michael Ströder <michael@stroeder.com>

References:
- Ldap simple bind problems on slaves during network outage
  - From: Christian Kratzer <ck-lists@cksoft.de>

Prev by Date: Re: 2.4.26 : too many open files
Next by Date: Re: 2.4.26 : too many open files
Index(es):
- Chronological
- Thread