[Date Prev][Date Next] [Chronological] [Thread] [Top]

N-Way Multimaster with syncrepl and delta-repl -- slapd stops responding



Hi,

I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.

* I upgraded the 4 masters to 2.4.33 and kept the syncrepl configuration. The syncrepl masters were using RefreshAndPersist while the slave consumers were using RefreshOnly.
* After the upgrade the 2.4.33 masters began locking up, not refusing connections, but not returning queries--this would happen 3-4 per day. When one master locked all the masters would lock. Slaves appear to not be affected by this.
* I downgraded back 2.4.23 in all of the masters only to have the lock-ups continue.
* I slapcat'ed the database on one master and blew away the databases on all the other masters and slaves and rebuilt everything. I rebuilt one master and one slave and rsync'ed the slapd.d directory where needed. Then I started each master one-by-one to validate that they mirrored the databases correctly. Then I repeated this on the slaves. Unfortunately the masters would continue to lock up as above.
* So, seeing that the lock-ups were occurring regardless of the openldap version I decided to go back to 2.4.33 and make the move to delta-replication.
* This past weekend I finally got delta-replication working. I did the slapcat-rebuild slapd.d-slapadd on one master and rsync'ed slapd.d to each master one at a time. All was well and all databases were in perfect sync.
* Unfortunately the masters would continue to lock, accepting connections but never servicing the request so all queries would hang.

Looking at this again today I noticed that my masters were all running at near 100% CPU but continuing to service queries. Depending on the # of CPUs only one or two threads would be running this high. Using strace -tt -p <pid-ofthread>, this is what would be spewing out:

18:52:05.713266 sched_yield()           = 0
18:52:05.713323 sched_yield()           = 0
18:52:05.713380 sched_yield()           = 0
18:52:05.713438 sched_yield()           = 0
18:52:05.713495 sched_yield()           = 0
18:52:05.713553 sched_yield()           = 0
18:52:05.713611 sched_yield()           = 0
18:52:05.713668 sched_yield()           = 0
18:52:05.713726 sched_yield()           = 0
18:52:05.713783 sched_yield()           = 0
18:52:05.713840 sched_yield()           = 0
18:52:05.713898 sched_yield()           = 0

I haven't correlated this to the slapd daemons hanging, yet.

There is nothing interesting in the logs when the slapd daemons would hang. Again when one master hangs they all would hang. I would restart each master one by one and on occasions when one master restarted the others would start servicing again. Other times it would take two or three restarts to get all of the masters servicing again. The only gain with delta-replication is that they only hang once a day now and usually after I had gone home.

For now I have implemented a small script that is run from cron every two minutes to test the slapd daemons if they are hung doing a simple ldapsearch and if so then restart the slapd daemon. This is done on all four masters. My database is not large at all with only ~100 users but it is critical as it is the backend authentication for everything including the remote access.

Here is the slapcat of my cn=config database (minus the schemas and operational attributes). It is a fairly typical delta-replication configuration. The accesslogs use hdb as that is what most (all) of the accesslogs examples show. The main database is bdb.

Any suggestions  would be greatly appreciated.

Regards,
Bob
--bs

dn: cn=config
objectClass: olcGlobal
cn: config
olcConfigFile: slapd.conf
olcConfigDir: slapd.d
olcArgsFile: /var/run/openldap/slapd.args
olcAttributeOptions: lang-
olcAuthzPolicy: none
olcConcurrency: 0
olcConnMaxPendingAuth: 1000
olcGentleHUP: FALSE
olcIdleTimeout: 0
olcIndexSubstrIfMaxLen: 4
olcIndexSubstrIfMinLen: 2
olcIndexSubstrAnyLen: 4
olcIndexSubstrAnyStep: 2
olcIndexIntLen: 4
olcLocalSSF: 71
olcPidFile: /var/run/openldap/slapd.pid
olcReadOnly: FALSE
olcSaslSecProps: noplain,noanonymous
olcSecurity: tls=1
olcServerID: 1 ldap://auth1noc.man.o3b.local
olcServerID: 2 ldap://auth2noc.man.o3b.local
olcServerID: 3 ldap://auth1noc.btz.o3b.local
olcServerID: 4 ldap://auth2noc.btz.o3b.local
olcServerID: 5 ldap://auth1gw.nma.o3b.local
olcServerID: 6 ldap://auth2gw.nma.o3b.local
olcServerID: 7 ldap://auth1gw.sun.o3b.local
olcServerID: 8 ldap://auth2gw.sun.o3b.local
olcServerID: 9 ldap://auth1gw.per.o3b.local
olcServerID: 10 ldap://auth2gw.per.o3b.local
olcSockbufMaxIncoming: 262143
olcSockbufMaxIncomingAuth: 16777215
olcThreads: 16
olcTLSCipherSuite: HIGH:MEDIUM:SSLv2
olcTLSCertificateFile: /etc/openldap/cacerts/auth-o3b.crt
olcTLSCertificateKeyFile: /etc/openldap/cacerts/auth-o3b.key
olcTLSCRLCheck: none
olcToolThreads: 1
olcWriteTimeout: 0
olcTLSCACertificateFile: /etc/pki/tls/certs/o3b-master-ca.crt
olcTLSVerifyClient: never
olcLogLevel: sync
olcConnMaxPending: 101

dn: cn=module{0},cn=config
objectClass: olcModuleList
cn: module{0}
olcModulePath: /usr/lib64/openldap
olcModuleLoad: {0}syncprov.la
olcModuleLoad: {1}memberof.la
olcModuleLoad: {2}ppolicy.la
olcModuleLoad: {3}accesslog.la

dn: olcDatabase={-1}frontend,cn=config
objectClass: olcDatabaseConfig
objectClass: olcFrontendConfig
olcDatabase: {-1}frontend
olcAccess: {0}to dn.base=""  by * read
olcAccess: {1}to dn.subtree="cn=monitor"  by dn.base="cn=rootdn,dc=o3bnetworks
.net" read
olcAccess: {2}to dn.base="cn=subschema"  by * read
olcAddContentAcl: FALSE
olcLastMod: TRUE
olcMaxDerefDepth: 0
olcReadOnly: FALSE
olcSchemaDN: cn=Subschema
olcSecurity: tls=1
olcMonitoring: FALSE
olcPasswordHash: {SSHA}

dn: olcDatabase={0}config,cn=config
objectClass: olcDatabaseConfig
olcDatabase: {0}config
olcAccess: {0}to *  by dn.base="cn=rootdn,dc=o3bnetworks.net" write  by dn.bas
e="cn=syncdn,dc=o3bnetworks.net" read  by * none
olcAddContentAcl: TRUE
olcLastMod: TRUE
olcLimits: {0}dn.base="cn=rootdn,dc=o3bnetworks.net" size=unlimited  time=unli
mited
olcLimits: {1}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited  time=unli
mited
olcMaxDerefDepth: 15
olcReadOnly: FALSE
olcRootDN: cn=config
olcMirrorMode: TRUE
olcMonitoring: FALSE
olcRootPW:: ***
olcSyncrepl: {0}rid=001 provider=ldap://auth1noc.man.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5
  5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(
reqResult=0))" syncdata=accesslog
olcSyncrepl: {1}rid=002 provider=ldap://auth2noc.man.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5
  5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(
reqResult=0))" syncdata=accesslog
olcSyncrepl: {2}rid=003 provider=ldap://auth1noc.btz.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5
  5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(
reqResult=0))" syncdata=accesslog
olcSyncrepl: {3}rid=004 provider=ldap://auth2noc.btz.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5
  5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(
reqResult=0))" syncdata=accesslog

dn: olcOverlay={0}syncprov,olcDatabase={0}config,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {1}syncprov
olcSpCheckpoint: 1000 60

dn: olcOverlay={1}accesslog,olcDatabase={0}config,cn=config
objectClass: olcOverlayConfig
objectClass: olcAccessLogConfig
olcOverlay: {1}accesslog
olcAccessLogDB: cn=accesslog
olcAccessLogOps: writes
olcAccessLogSuccess: TRUE
olcAccessLogPurge: 2+00:00 1+00:00

dn: olcDatabase={1}hdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcConfig
objectClass: top
objectClass: olcHdbConfig
olcDbDirectory: /var/lib/ldap/accesslog
olcSuffix: cn=accesslog
olcDbConfig: [Deleted]
aXIgLXEgb3B0aW9uKS4g
olcAddContentAcl: FALSE
olcDbCacheFree: 1
olcDbCacheSize: 1000
olcAccess: {0}to *  by self write  by dn.base="cn=rootdn,dc=o3bnetworks.net" r
ead by dn.base="cn=authdn,dc=o3bnetworks.net" read  by dn.base="cn=syncdn,dc=
o3bnetworks.net" read
olcDbDirtyRead: FALSE
olcDbIDLcacheSize: 0
olcDbDNcacheSize: 0
olcDbIndex: default eq
olcMaxDerefDepth: 15
olcLimits: {0}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited  time=unli
mited
olcDbSearchStack: 16
olcLastMod: TRUE
olcDbLinearIndex: FALSE
olcDbMode: 0600
olcDbNoSync: FALSE
olcDbShmKey: 0
olcReadOnly: FALSE
olcSecurity: tls=1
olcRootDN: cn=accesslogdn
olcDatabase: {1}hdb

dn: olcOverlay={0}syncprov,olcDatabase={1}hdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {0}syncprov
olcSpNoPresent: TRUE
olcSpReloadHint: TRUE

dn: olcDatabase={3}monitor,cn=config
objectClass: olcDatabaseConfig
olcAddContentAcl: FALSE
olcLastMod: TRUE
olcMaxDerefDepth: 15
olcReadOnly: FALSE
olcRootDN: cn=monitor,cn=Monitor
olcRootPW:: bW9uaXRvcg==
olcSecurity: tls=1
olcMonitoring: FALSE
olcDatabase: {3}monitor

dn: olcDatabase={3}bdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcBdbConfig
olcSuffix: dc=o3bnetworks.net
olcAddContentAcl: FALSE
olcLastMod: TRUE
olcLimits: {0}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited  time=unli
mited
olcMaxDerefDepth: 15
olcReadOnly: FALSE
olcRootDN: cn=rootdn,dc=o3bnetworks.net
olcRootPW:: ***
olcSecurity: tls=1
olcMirrorMode: TRUE
olcMonitoring: TRUE
olcDbDirectory: /var/lib/ldap
olcDbConfig: [Deleted]
olcDbNoSync: FALSE
olcDbDirtyRead: FALSE
olcDbIDLcacheSize: 0
olcDbIndex: objectClass pres,eq
olcDbIndex: cn pres,eq,sub
olcDbIndex: uid pres,eq,sub
olcDbIndex: uidNumber pres,eq
olcDbIndex: gidNumber pres,eq
olcDbIndex: memberUid pres,eq,sub
olcDbIndex: displayName pres,eq,sub
olcDbIndex: sambaSID pres,eq,sub
olcDbIndex: sambaDomainName pres,eq
olcDbIndex: sambaGroupType pres,eq
olcDbIndex: ou pres,eq,sub
olcDbIndex: sambaSIDList pres,eq
olcDbLinearIndex: FALSE
olcDbMode: 0600
olcDbSearchStack: 16
olcDbShmKey: 0
olcDbCacheFree: 1
olcDbDNcacheSize: 0
olcAccess: {0}to *  by self write  by group/groupOfNames/member.exact="cn=ldap
admins,dc=o3bnetworks.net" write  by dn.base="cn=authdn,dc=o3bnetworks.net" r
ead  by dn.base="cn=syncdn,dc=o3bnetworks.net" read  by users read  by anonym
ous read
olcDbCacheSize: 1000
olcDatabase: {3}bdb
olcSyncrepl: {0}rid=011 provider=ldap://auth1noc.man.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="dc=o3bnetworks.net" scope=sub schemachecking=off type=refreshAndPersist
  retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWrit
eObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {1}rid=012 provider=ldap://auth2noc.man.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search
base="dc=o3bnetworks.net" scope=sub schemachecking=off type=refreshAndPersist
  retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWrit
eObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {2}rid=013 provider=ldap://auth1noc.btz.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 filter
="(objectclass=*)" searchbase="dc=o3bnetworks.net" scope=sub schemachecking=o
ff type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter=
"(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {3}rid=014 provider=ldap://auth2noc.btz.o3b.local bindmethod=simp
le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0
:5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 filter
="(objectclass=*)" searchbase="dc=o3bnetworks.net" scope=sub schemachecking=o
ff type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter=
"(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog

dn: olcOverlay={0}memberof,olcDatabase={3}bdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcMemberOf
olcOverlay: {0}memberof
olcMemberOfDangling: ignore
olcMemberOfRefInt: FALSE

dn: olcOverlay={1}syncprov,olcDatabase={3}bdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {1}syncprov
olcSpCheckpoint: 1000 60

dn: olcOverlay={2}ppolicy,olcDatabase={3}bdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcConfig
objectClass: top
objectClass: olcPPolicyConfig
olcOverlay: {2}ppolicy
olcPPolicyDefault: cn=O3b,ou=Password,ou=Policy,dc=o3bnetworks.net

dn: olcOverlay={3}accesslog,olcDatabase={3}bdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcAccessLogConfig
olcOverlay: {3}accesslog
olcAccessLogOps: writes
olcAccessLogSuccess: TRUE
olcAccessLogDB: cn=accesslog
olcAccessLogPurge: 2+00:00 1+00:00