[Date Prev][Date Next] [Chronological] [Thread] [Top]

Problems with replicas



I have one master server, and two slave servers.

Master config file:
----- s n i p -----
replogfile      /var/lib/openldap/replication.log
replica         host=_SLAVE1_:389 binddn="_REPLICA-DN_"
                bindmethod=simple credentials=_SECRET_
replica         host=_SLAVE2_:389 binddn="_REPLICA-DN_"
                bindmethod=simple credentials=_SECRET_

Slave config file (EXACTLY the same on both slaves!):
----- s n i p -----
loglevel 256	# Good longterm logging/debugging...
include		/etc/openldap/slapd.includes
schemacheck	on
pidfile		/var/run/slapd.pid
database	ldbm
directory	"/var/lib/openldap"
cachesize	10000
dbcachesize	50000
updatedn	"_REPLICA-DN_"
lastmod		on
sizelimit	1500
index		uid,mail,mailalternateaddress,mailforwardingaddress eq
suffix		"c=SE"
include		/etc/openldap/slapd.access
----- s n i p -----

Censoring explanation:
        _SLAVE[12]_ is the IP address of the slave servers.
        _REPLICA-DN_ is same (exactly! Doublechecked) on both slaves
        _SECRET_ is same (exactly! Doublechecked) on both slaves, in cleartext.

Doing a  write/modify/delete on the master, propagates  the changes to
_SLAVE2_ as it should, but NOT to _SLAVE1_!!!

Somethimes, if  I wait  long enough (~  10-15 minutes) it  finaly goes
through, but not always...


To  try to  track this  problem  down, I  shutdown all  of the  slurpd
processes on the  master, added a object and then  did a little (well,
not so little :) one-liner to search all three hosts.

----- s n i p -----
_MASTERIP_: _ADDED-OBJECT_ 
_SLAVE1_: 
_SLAVE2_: 
----- s n i p -----

Censoring explanation:
        _MASTERIP_ is the IP address to the master server
        _ADDED-OBJECT_ is the DN of the object to add

Executing 

time slurpd -d 255 -o -r /var/lib/openldap/replication.log' 2>&1 | tee /tmp/out

will give this after about 15 seconds (not before):

----- s n i p -----
MASTERIP: _ADDED-OBJECT_ 
_SLAVE1_: 
_SLAVE2_: _ADDED-OBJECT_ 
----- s n i p -----


After about 2 minutes, I shut  the slurpd down, and executed it again,
to make it only propagate the changes to _SLAVE1_...

This  is what happens  in the  file '/tmp/out'  while waiting  for the
propagation (censored ofcource :):

It seems to hang at the ldap_send_server_request()...

----- s n i p -----
Config: opening config file "/etc/openldap/slapd.conf"
Config: (loglevel 2048)
Config: (include		/etc/openldap/slapd.includes)
Config: (include		/etc/openldap/slapd.access)
Config: (schemacheck	on)
Config: (pidfile		/var/run/slapd.pid)
Config: (database	ldbm)
Config: (suffix		"c=SE")
Config: (directory	"/var/lib/openldap")
Config: (cachesize	10000)
Config: (dbcachesize	1000000)
Config: (dbcachenowsync )
Config: (lastmod		on)
Config: (sizelimit	1500)
Config: (index		uid,mail,mailalternateaddress,mailforwardingaddress eq)
Config: (replogfile      /var/lib/openldap/replication.log)
Config: (replica         host=_SLAVE1_:389 binddn="cn=admin,ou=Users,o=Air2Net,c=se"                bindmethod=simple credentials=_SECRET_)
Config: ** successfully added replica "_SLAVE1_:389"
Config: (replica         host=_SLAVE2_:389 binddn="cn=admin,ou=Users,o=Air2Net,c=se"                bindmethod=simple credentials=_SECRET_)
Config: ** successfully added replica "_SLAVE2_:389"
Config: ** configuration file successfully read and parsed
Retrieved state information for _SLAVE1_:389 (timestamp 974384051.0)
Retrieved state information for _SLAVE2_:389 (timestamp 974384367.0)
begin replication thread for _SLAVE1_:389
Replica _SLAVE1_:389, skip repl record for _ADDED-OBJECT_ (old)
Open connection to _SLAVE1_:389
ldap_open
begin replication thread for _SLAVE2_:389
Replica _SLAVE2_:389, skip repl record for _ADDED-OBJECT_ (old)
Replica _SLAVE2_:389, skip repl record for _ADDED-OBJECT_ (old)
end replication thread for _SLAVE2_:389
ldap_init
ldap_delayed_open
open_ldap_connection
ldap_connect_to_host: _SLAVE1_:389
sd 6 connected to: _SLAVE1_
ldap_open successful, ld_host is (null)
bind to _SLAVE1_:389 as _REPLICA-DN_ (simple)
ldap_simple_bind_s
ldap_simple_bind
ldap_send_initial_request
ldap_delayed_open
ldap_send_server_request
ldap_result
wait4msg (infinite timeout)
** Connections:
* host: _SLAVE1_  port: 389  (default)
  refcnt: 2  status: Connected
  last used: Thu Nov 16 15:23:25 2000

** Outstanding Requests:
 * msgid 1,  origid 1, status InProgress
   outstanding referrals 0, parent count 0
** Response Queue:
   Empty
do_ldap_select
read1msg
got result msgid 1, original id 1
read1msg:  0 new referrals
request 1 done
res_errno: 0, res_error: <>, res_matched: <>
ldap_free_request (origid 1, msgid 1)
ldap_free_connection
ldap_free_connection: refcnt 1
ldap_result2error
ldap_msgfree
replica _SLAVE1_:389 - add dn "_ADDED-OBJECT_"
ldap_add
ldap_send_initial_request
ldap_delayed_open
ldap_send_server_request
----- s n i p -----

After 13m10.520s, slurpd succeed. This  is the 'followup' on the debug
output:

This part takes about 20-30 seconds 'only'...

----- s n i p -----
ldap_result
wait4msg (infinite timeout)
** Connections:
* host: _SLAVE1_  port: 389  (default)
  refcnt: 2  status: Connected
  last used: Thu Nov 16 15:23:25 2000

** Outstanding Requests:
 * msgid 2,  origid 2, status InProgress
   outstanding referrals 0, parent count 0
** Response Queue:
   Empty
do_ldap_select

read1msg
got result msgid 2, original id 2
read1msg:  0 new referrals
request 2 done
res_errno: 0, res_error: <>, res_matched: <>
ldap_free_request (origid 2, msgid 2)
ldap_free_connection
ldap_free_connection: refcnt 1
ldap_result2error
ldap_msgfree
end replication thread for _SLAVE1_:389
slurpd: terminating normally
Processing in one-shot mode:
2 total replication records in file,
2 replication records to process.

real    13m10.520s
user    0m0.540s
sys     0m0.160s
----- s n i p -----


Since both the master, as the two slaves are using the same home-built
Debian package of OpenLDAP 1.2.11  (to use SleepyCAT db instead of the
default) I'm more inclined to belive that this somehow have to do with
the OS (same version of Debian GNU/Linux on all three machines) or the
network (the master and SLAVE1 are on the same localnetwork, while the
SLAVE2 is some distance away).


Anyone have any idea where to look further for the problem I'm having?