[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#6639) syncrepl failing using SASL/GSSAPI



Full_Name: Bill MacAllister
Version: 2.4.23-1
OS: Debian Lenny
URL: http://www.stanford.edu/~whm/files/ldap-replication-error.txt
Submission from: (NULL) (99.55.172.208)


I just replaced a 2.3.43 OpenLDAP master with a 2.4.23-1 master.  There are 6
slaves all running 2.4.23-1.  After reloading the master database, I confirmed
that replication was working my making some changes and then examining
contextCSNs.  Sometime later replication stopped on one of the servers.  Looking
at the log on the failing slave I saw:

Sep  2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000
LDAP_RES_INTERMEDIATE - SYNC_ID_SET
Sep  2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 (-1) Can't contact
LDAP server
Sep  2 22:12:53 directory3 slapd[7005]: do_syncrepl: rid=000 rc -1 retrying

On the master I saw:

Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 ACCEPT from
IP=171.67.218.156:44519 (IP=0.0.0.0:389)
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 RESULT tag=97
err=14 text=SASL(0): successful result: 
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 RESULT tag=97
err=14 text=SASL(0): successful result: 
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND
authcid="ldap/directory3.stanford.edu@stanford.edu"
authzid="ldap/directory3.stanford.edu@stanford.edu"
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND
dn="cn=directory3,cn=ldap,cn=operational,dc=stanford,dc=edu" mech=GSSAPI
sasl_ssf=56 ssf=56
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 RESULT tag=97
err=0 text=
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 closed
(connection lost on write)

Running slapd on the failing slave with -d -1 I saw:

=>do_syncrep2 rid=000
ldap_result ld 0x1c07000 msgid 4
wait4msg ld 0x1c07000 msgid 4 (timeout 0 usec)
wait4msg continue ld 0x1c07000 msgid 4 all 0
** ld 0x1c07000 Connections:
* host: directory-master.stanford.edu  port: 389  (default)
  refcnt: 2  status: Connected
  last used: Thu Sep  2 22:12:53 2010


** ld 0x1c07000 Outstanding Requests:
 * msgid 4,  origid 4, status InProgress
   outstanding referrals 0, parent count 0
  ld 0x1c07000 request count 1 (abandoned 0)
** ld 0x1c07000 Response Queue:
   Empty
  ld 0x1c07000 response count 0
ldap_chkResponseList ld 0x1c07000 msgid 4 all 0
ldap_chkResponseList returns ld 0x1c07000 NULL
ldap_int_select
read1msg: ld 0x1c07000 msgid 4 all 0
ber_get_next
sasl_generic_read: want=4, got=4
  0000:  00 00 00 78                                        ...x              
sasl_generic_read: want=120, got=120

... lots of output ...

sasl_generic_read: want=45833, got=0

ldap_read: want=8, got=0

ber_get_next failed.
ldap_err2string
do_syncrep2: rid=000 (-1) Can't contact LDAP server

I am including a pointer to the complete debug log.

Looks like something in the sasl support is unhappy.  We are using
libsasl2-modules-gssapi-heimdal 2.1.23.dfsg1-5~sbp50+2.

Strange to me that this one server is having a problem.  There are actually 5
slaves that are working just fine.  Here are current contextCSNs for the
systems.

directory-master contextCSN: 20100903060040.713677Z#000000#000#000000
directory1       contextCSN: 20100903060040.713677Z#000000#000#000000
directory2       contextCSN: 20100903060040.713677Z#000000#000#000000
directory3       contextCSN: 20100903015531.064407Z#000000#000#000000
directory4       contextCSN: 20100903060040.713677Z#000000#000#000000
directory11      contextCSN: 20100903060040.713677Z#000000#000#000000
directory12      contextCSN: 20100903060040.713677Z#000000#000#000000

What else should I look at?

Bill