[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#4562) slapd crash (syncrepl related?)



Full_Name: Kevin Spicer
Version: 2.3.23
OS: Solaris 9 sparc
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (198.178.236.140)


I started to move to syncrepl replication last night, using syncrepl between two
of my servers.  At around 5am this morning slapd on the provider for the
superior database crashed (this is also a consumer for one of the subordinate
databases).  The backtrace seems to suggest that syncprov went into a recursive
loop.  I have a core available should you require additional debugging
output....  I've only seen this in my production environment, my test box has
(so far) been stable with two slapd instances replicating copies of my
production database. 

Openldap 2.3.23 compiled as follows...
./configure --prefix=/usr/local --enable-bdb --enable-crypt --with-threads
--with-tls --without-kerberos --enable-wrappers --enable-modules
--enable-ppolicy=mod --enable-syncprov=mod  
CFLAGS=-g
BDB 4.2.52 + patches

slap.conf looks like this (confidential information changed, ACLs, Indexes and
Schemas omitted for brevity)...

#### START SLAPD.CONF
# Various schemas [DELETED]

allow bind_v2
sizelimit 10000
loglevel 256
pidfile         /var/run/slapd/slapd.pid
argsfile        /usr/local/var/slapd.args
replica-pidfile         /var/run/slapd/slurpd.pid
replica-argsfile        /usr/local/var/slurpd.args
replicationinterval 60

defaultsearchbase dc=mydomain,dc=com
threads 8
password-hash {MD5}

modulepath      /usr/local/libexec/openldap
moduleload      ppolicy.la
moduleload      syncprov.la

TLSCipherSuite HIGH:+TLSv1:+SSLv2:+SSLv3
TLSCACertificateFile /usr/local/etc/openldap/certs/cacert.pem
TLSCertificateFile /usr/local/etc/openldap/certs/slapd-cert.pem
TLSCertificateKeyFile /usr/local/etc/openldap/certs/slapd-key.pem
security ssf=0 tls=0 update_ssf=128 simple_bind=128 update_tls=128

# Various ACLs [DELETED]

database        bdb
suffix          "ou=server2,ou=machines,dc=mydomain,dc=com"
rootdn          "cn=Manager,dc=mydomain,dc=com"
syncrepl rid=101
        provider=ldaps://server2.mydomain.com
        type=refreshAndPersist
        retry=30,10 120,30 300,+
        binddn=cn=syncuser,dc=mydomain,dc=com
        bindmethod=simple
        credentials=asecret
        searchbase="ou=server2,ou=machines,dc=mydomain,dc=com"
updateref       ldaps://server2.mydomain.com
directory       /var/db/ldap/server2
mode            0600
subordinate
# Indexes [DELETED]
cachesize 5000
checkpoint 512 720

####
database        bdb
suffix          "ou=server3,ou=machines,dc=mydomain,dc=com"
rootdn          "cn=Manager,dc=mydomain,dc=com"
updatedn        cn=syncuser,dc=mydomain,dc=com
updateref       ldaps://server3.mydomain.com
directory       /var/db/ldap/server3
mode            0600
subordinate
# Indexes [DELETED]
cachesize 5000
checkpoint 512 720

#####
database        bdb
suffix          "ou=server4,ou=machines,dc=mydomain,dc=com"
rootdn          "cn=Manager,dc=mydomain,dc=com"
updatedn        cn=syncuser,dc=mydomain,dc=com
updateref       ldaps://server4.mydomain.com
directory       /var/db/ldap/server4
mode            0600
subordinate
# Indexes [DELETED]
cachesize 5000
checkpoint 512 720

###

database        bdb
suffix          "dc=mydomain,dc=com"
rootdn          "cn=Manager,dc=mydomain,dc=com"
rootpw          asecret
directory       /var/db/ldap/central
mode            0600
overlay         glue
overlay         syncprov
overlay         ppolicy
ppolicy_default "cn=users,ou=policy,dc=mydomain,dc=com"
ppolicy_use_lockout
syncprov-checkpoint 100 10
syncprov-sessionlog 100
replica uri=ldaps://server3.mydomain.com:636
        binddn="cn=syncuser,dc=mydomain,dc=com"
        bindmethod=simple credentials=asecret
replica uri=ldaps://server4.mydomain.com:636
        binddn="cn=syncuser,dc=mydomain,dc=com"
        bindmethod=simple credentials=asecret
replogfile /var/db/ldap/replogfile
# Indexes [DELETED]
# Database specific ACLS {DELETED]

###### END SLAPD.CONF


There is nothing particularly unusual in the logs, the last entry is a
connection from slurpd on one of the servers that hasn't yet been converted to
syncrepl - changing an entry on subordinate database server4.  

gdb on the core file gives this...

gdb -c core.slapd.21717 /usr/local/libexec/slapd

(gdb) bt
#0  0x000cfecc in syncrepl_config ()
#1  0x000cff7c in syncrepl_config ()
#2  0x000cff7c in syncrepl_config ()
#3  0x000cff7c in syncrepl_config ()
#4  0x000cff7c in syncrepl_config ()
#5  0x000cff7c in syncrepl_config ()
#6  0x000cff7c in syncrepl_config ()
#7  0x000cff7c in syncrepl_config ()
#8  0x000cff7c in syncrepl_config ()
#9  0x000cff7c in syncrepl_config ()
#10 0x000cff7c in syncrepl_config ()
#11 0x000cff7c in syncrepl_config ()
############## ...then another 10900 or so similar lines, then...
#################
#10909 0x000cff7c in syncrepl_config ()
#10910 0x00057f8c in be_entry_release_rw ()

#10911 0xfece4e84 in syncprov_matchops (op=0xc04080, opc=0x649b2c, saveit=0) at
syncprov.c:1075
#10912 0xfece69dc in syncprov_op_response (op=0xc04080, rs=0xf73ffd30) at
syncprov.c:1521
#10913 0x0005b7bc in slap_req2res ()
#10914 0x0005cabc in slap_send_ldap_result ()
#10915 0x000dffa8 in bdb_modify ()
#10916 0x000cec24 in syncrepl_config ()
#10917 0x000d1bd4 in overlay_op_walk ()
#10918 0x000d1e98 in overlay_op_walk ()
#10919 0x000d1fa8 in overlay_op_walk ()
#10920 0x00068a98 in fe_op_modify ()
#10921 0x00067960 in do_modify ()
#10922 0x000440ec in connection_done ()
#10923 0x00164524 in ldap_pvt_thread_pool_destroy ()
#10924 0xfed957bc in _lwp_start () from /usr/lib/libthread.so.1
#10925 0xfed957bc in _lwp_start () from /usr/lib/libthread.so.1
Previous frame identical to this frame (corrupt stack?)