[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#4618) Syncrepl/glue interaction (perhaps related to ITS 4323)



Full_Name: Kevin Spicer
Version: 2.3.24
OS: Solaris 9
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (198.178.236.140)


[Resubmitting this as I submitted on Monday, but it hasn't appeared in ITS]

I was pretty sure that all the issues related to ITS 4323 were resolved, however
I had an incident on Monday which leads me to suspect there is still a problem
there.

I have four servers serving a distributed, glued, replicated, directory.  The
reason for this design is I need to ensure service over geographically dispersed
sites, and allow local modification of certain parts of the tree even in the
event of a WAN outage at any site.  One server is the syncrepl provider for the
superior database and a consumer for the other three subordinate databases.  The
other three servers are each a provider for a subordinate database and a
consumer for the superior and the other two subordinates.

At some point on Monday one of the subordinate masters (lets call it server2)
fell off the network (actually someone doing some work dislodged the patch
cable) and was unavailable for about an hour.  During this time the server
remained running normally, but without network service.  Once the server came
back on-line I received a complaint that some entries added to the superior
during the outage had not been replicated to server2.  I rectified this by
restarting slapd which prompted syncrepl to resync immediately.  However, later
in the afternoon I recieved complaints of missing entries in the subordinates on
server2.  It turned out that the subordinate master on server2 only contained
those entries added during the outage, and one of the subordinate slaves was
completely empty.  (the other I suspect may have been emptied but then
replicated back).  Worse still the subordinate master had replicated its
near-empty status to all the slaves, requiring it to be restored from backup
(rather than slapcat/slapadd from a slave copy).  I was able to recover the
subordinate slave databases on server2 by deleting their database files and
restarting slapd, allowing syncrepl to do its stuff.

This is openldap 2.3.24 compiled as follows... 
./configure --prefix=/usr/local --enable-bdb --enable-crypt --with-threads
--with-tls --without-kerberos --enable-wrappers --enable-modules
--enable-ppolicy=mod --enable-syncprov=mod

My slapd.conf on server2 looks like the following...
# [ VARIOUS SCHEMA DIRECTIVES ]
allow bind_v2
sizelimit 10000
loglevel 256
pidfile         /var/run/slapd/slapd.pid
argsfile        /usr/local/var/slapd.args
defaultsearchbase dc=domain,dc=com
threads 8
password-hash {MD5}
modulepath      /usr/local/libexec/openldap
moduleload      ppolicy.la
moduleload      syncprov.la
TLSCipherSuite HIGH:+TLSv1:+SSLv2:+SSLv3
TLSCACertificateFile /usr/local/etc/openldap/certs/cacert.pem
TLSCertificateFile /usr/local/etc/openldap/certs/server2.slapd-cert.pem
TLSCertificateKeyFile /usr/local/etc/openldap/certs/server2.slapd-key.pem
security ssf=0 tls=0 update_ssf=128 simple_bind=128 update_tls=128
# [ VARIOUS ACL ENTRIES ]
# This is the database definition for the portion of the DIT which
# relates directly to this machine
database        bdb
suffix          "ou=server2,ou=machines,dc=domain,dc=com"
rootdn          "cn=Manager,dc=domain,dc=com"
directory       /var/db/ldap/server2
mode            0600
subordinate
overlay         syncprov
syncprov-checkpoint     100 10
syncprov-sessionlog     100
# [ VARIOUS INDICES ]
cachesize 5000
checkpoint 512 720
## This is a database definition for a replica of one of the subsidiary DITS
database        bdb
suffix          "ou=server3,ou=machines,dc=domain,dc=com"
rootdn          "cn=Manager,dc=domain,dc=com"
syncrepl rid=122
        provider=ldaps://server3.domain.com
        type=refreshAndPersist
        retry=30,10 120,30 300,+
        binddn=cn=syncuser,dc=domain,dc=com
        bindmethod=simple
        credentials=secret
        searchbase="ou=server3,ou=machines,dc=domain,dc=com"
updateref       ldaps://server3.domain.com
directory       /var/db/ldap/server3
mode            0600
subordinate
# [ VARIOUS INDICES ]
cachesize 5000
checkpoint 512 720
## This is a database definition for a replica of one of the subsidiary DITS
database        bdb
suffix          "ou=server4,ou=machines,dc=domain,dc=com"
rootdn          "cn=Manager,dc=domain,dc=com"
syncrepl rid=123
        provider=ldaps://server4.domain.com
        type=refreshAndPersist
        retry=30,10 120,30 300,+
        binddn=cn=syncuser,dc=domain,dc=com
        bindmethod=simple
        credentials=secret
        searchbase="ou=server4,ou=machines,dc=domain,dc=com"
updateref       ldaps://server4.domain.com
directory       /var/db/ldap/server4
mode            0600
subordinate
# [ VARIOUS INDICES ]
cachesize 5000
checkpoint 512 720
# This portion deals with the main database on REPLICA servers
database        bdb
suffix          "dc=domain,dc=com"
rootdn          "cn=Manager,dc=domain,dc=com"
rootpw          {MD5}XXXXXXXXXXX
syncrepl rid=121
        provider=ldaps://server1.group.local
        type=refreshAndPersist
        retry=30,10 120,30 300,+
        binddn=cn=syncuser,dc=domain,dc=com
        bindmethod=simple
        credentials=secret
        searchbase="dc=domain,dc=com"
updateref       ldaps://server1.group.local
directory       /var/db/ldap/server1
mode            0600
overlay         glue
overlay         ppolicy
ppolicy_default "cn=systemusers,ou=policy,dc=domain,dc=com"
ppolicy_use_lockout
# [ VARIOUS INDICES ]
cachesize 5000
checkpoint 512 720