Issue 6639 - syncrepl failing using SASL/GSSAPI
Summary: syncrepl failing using SASL/GSSAPI
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-03 06:07 UTC by bill@ca-zephyr.org
Modified: 2014-08-01 21:04 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description bill@ca-zephyr.org 2010-09-03 06:07:54 UTC
Full_Name: Bill MacAllister
Version: 2.4.23-1
OS: Debian Lenny
URL: http://www.stanford.edu/~whm/files/ldap-replication-error.txt
Submission from: (NULL) (99.55.172.208)


I just replaced a 2.3.43 OpenLDAP master with a 2.4.23-1 master.  There are 6
slaves all running 2.4.23-1.  After reloading the master database, I confirmed
that replication was working my making some changes and then examining
contextCSNs.  Sometime later replication stopped on one of the servers.  Looking
at the log on the failing slave I saw:

Sep  2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000
LDAP_RES_INTERMEDIATE - SYNC_ID_SET
Sep  2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 (-1) Can't contact
LDAP server
Sep  2 22:12:53 directory3 slapd[7005]: do_syncrepl: rid=000 rc -1 retrying

On the master I saw:

Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 ACCEPT from
IP=171.67.218.156:44519 (IP=0.0.0.0:389)
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 RESULT tag=97
err=14 text=SASL(0): successful result: 
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 RESULT tag=97
err=14 text=SASL(0): successful result: 
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn=""
method=163
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND
authcid="ldap/directory3.stanford.edu@stanford.edu"
authzid="ldap/directory3.stanford.edu@stanford.edu"
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND
dn="cn=directory3,cn=ldap,cn=operational,dc=stanford,dc=edu" mech=GSSAPI
sasl_ssf=56 ssf=56
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 RESULT tag=97
err=0 text=
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH attr=reqDN
reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
Sep  2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 closed
(connection lost on write)

Running slapd on the failing slave with -d -1 I saw:

=>do_syncrep2 rid=000
ldap_result ld 0x1c07000 msgid 4
wait4msg ld 0x1c07000 msgid 4 (timeout 0 usec)
wait4msg continue ld 0x1c07000 msgid 4 all 0
** ld 0x1c07000 Connections:
* host: directory-master.stanford.edu  port: 389  (default)
  refcnt: 2  status: Connected
  last used: Thu Sep  2 22:12:53 2010


** ld 0x1c07000 Outstanding Requests:
 * msgid 4,  origid 4, status InProgress
   outstanding referrals 0, parent count 0
  ld 0x1c07000 request count 1 (abandoned 0)
** ld 0x1c07000 Response Queue:
   Empty
  ld 0x1c07000 response count 0
ldap_chkResponseList ld 0x1c07000 msgid 4 all 0
ldap_chkResponseList returns ld 0x1c07000 NULL
ldap_int_select
read1msg: ld 0x1c07000 msgid 4 all 0
ber_get_next
sasl_generic_read: want=4, got=4
  0000:  00 00 00 78                                        ...x              
sasl_generic_read: want=120, got=120

... lots of output ...

sasl_generic_read: want=45833, got=0

ldap_read: want=8, got=0

ber_get_next failed.
ldap_err2string
do_syncrep2: rid=000 (-1) Can't contact LDAP server

I am including a pointer to the complete debug log.

Looks like something in the sasl support is unhappy.  We are using
libsasl2-modules-gssapi-heimdal 2.1.23.dfsg1-5~sbp50+2.

Strange to me that this one server is having a problem.  There are actually 5
slaves that are working just fine.  Here are current contextCSNs for the
systems.

directory-master contextCSN: 20100903060040.713677Z#000000#000#000000
directory1       contextCSN: 20100903060040.713677Z#000000#000#000000
directory2       contextCSN: 20100903060040.713677Z#000000#000#000000
directory3       contextCSN: 20100903015531.064407Z#000000#000#000000
directory4       contextCSN: 20100903060040.713677Z#000000#000#000000
directory11      contextCSN: 20100903060040.713677Z#000000#000#000000
directory12      contextCSN: 20100903060040.713677Z#000000#000#000000

What else should I look at?

Bill
Comment 1 bill@ca-zephyr.org 2010-09-03 08:23:17 UTC
Worked on this a bit more and in the end rebooted the server just to
be sure I had a clean environment.  When the system came back up
slapd refused to start, the bdb database was corrupt.  I reloaded
from the master and replication is again working.  Not sure why a
reboot was required to uncover the error.  I had stopped and started
slapd multiple times without an errors being reported.

Sorry for the inbox pollution.

Bill


-- 

Bill MacAllister
Infrastructure Delivery Group, Stanford University

Comment 2 bill@ca-zephyr.org 2010-09-04 00:10:22 UTC

--On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister <whm@stanford.edu> wrote:

The problem with the database was only coincidental.  Restoring the database
got the failing replica past the problem replication event.

In the replica pool of 6 servers we have seen the problem on there of the
servers.  In thinking about this more it is unlikely that it is a slave
problem since the slaves have been in use for about 6 weeks and we did
not see the problem.  Only when we changed the master to 2.4.23 did we
see the problem.  I have captured a master debug log of the problem
event.  It is at http://www.stanford.edu/~whm/files/master-debug.txt.

Bill

-- 

Bill MacAllister
Infrastructure Delivery Group, Stanford University

Comment 3 Howard Chu 2010-09-10 05:43:06 UTC
whm@stanford.edu wrote:
> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu>  wrote:
>
> The problem with the database was only coincidental.  Restoring the database
> got the failing replica past the problem replication event.
>
> In the replica pool of 6 servers we have seen the problem on there of the
> servers.  In thinking about this more it is unlikely that it is a slave
> problem since the slaves have been in use for about 6 weeks and we did
> not see the problem.  Only when we changed the master to 2.4.23 did we
> see the problem.  I have captured a master debug log of the problem
> event.  It is at http://www.stanford.edu/~whm/files/master-debug.txt.
>
> Bill
>
Please try with this patch:

Index: sasl.c
===================================================================
RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v
retrieving revision 1.79
diff -u -r1.79 sasl.c
--- sasl.c	13 Apr 2010 20:17:56 -0000	1.79
+++ sasl.c	10 Sep 2010 05:42:22 -0000
@@ -733,8 +733,9 @@
  		return ret;
  	} else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) {
  		/* partial write? pretend nothing got written */
-		len2 = 0;
  		p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE;
+		sock_errset(EAGAIN);
+		len2 = -1;
  	}

  	/* return number of bytes encoded, not written, to ensure


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 bill@ca-zephyr.org 2010-09-10 10:10:20 UTC

--On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu <hyc@symas.com> wrote:

> whm@stanford.edu wrote:
>> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu>  wrote:
>>
>> The problem with the database was only coincidental.  Restoring the database
>> got the failing replica past the problem replication event.
>>
>> In the replica pool of 6 servers we have seen the problem on there of the
>> servers.  In thinking about this more it is unlikely that it is a slave
>> problem since the slaves have been in use for about 6 weeks and we did
>> not see the problem.  Only when we changed the master to 2.4.23 did we
>> see the problem.  I have captured a master debug log of the problem
>> event.  It is at http://www.stanford.edu/~whm/files/master-debug.txt.
>>
>> Bill
>>
> Please try with this patch:
>
> Index: sasl.c
> ===================================================================
> RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v
> retrieving revision 1.79
> diff -u -r1.79 sasl.c
> --- sasl.c	13 Apr 2010 20:17:56 -0000	1.79
> +++ sasl.c	10 Sep 2010 05:42:22 -0000
> @@ -733,8 +733,9 @@
>   		return ret;
>   	} else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) {
>   		/* partial write? pretend nothing got written */
> -		len2 = 0;
>   		p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE;
> +		sock_errset(EAGAIN);
> +		len2 = -1;
>   	}
>
>   	/* return number of bytes encoded, not written, to ensure

I have applied the patch and the debian packages built fine.  I have
installed the new packages on the master servers in our dev and test
environments.  Initial tests show that basic replication is working.
Of course, this problem did not exhibit itself in our test
environments.  I will install it in our production environment Friday
evening and let you know what happens.

Am I correct in assuming that the root of the problem is on the providers?
I am asking because I was not planning installing the new version on the
consumers right away since 'emergency' changes to production software
require special approval.  I would like to upgrade the providers in our
next maintenance window which is next Thursday.  If you think I should
do it with the update of the master let me know.

Bill

-- 

Bill MacAllister
Infrastructure Delivery Group, Stanford University

Comment 5 Howard Chu 2010-09-10 10:20:00 UTC
whm@stanford.edu wrote:
> --On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu<hyc@symas.com>  wrote:
>
>> whm@stanford.edu wrote:
>>> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu>   wrote:
>>>
>>> The problem with the database was only coincidental.  Restoring the database
>>> got the failing replica past the problem replication event.
>>>
>>> In the replica pool of 6 servers we have seen the problem on there of the
>>> servers.  In thinking about this more it is unlikely that it is a slave
>>> problem since the slaves have been in use for about 6 weeks and we did
>>> not see the problem.  Only when we changed the master to 2.4.23 did we
>>> see the problem.  I have captured a master debug log of the problem
>>> event.  It is at http://www.stanford.edu/~whm/files/master-debug.txt.
>>>
>>> Bill
>>>
>> Please try with this patch:
>>
>> Index: sasl.c
>> ===================================================================
>> RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v
>> retrieving revision 1.79
>> diff -u -r1.79 sasl.c
>> --- sasl.c	13 Apr 2010 20:17:56 -0000	1.79
>> +++ sasl.c	10 Sep 2010 05:42:22 -0000
>> @@ -733,8 +733,9 @@
>>    		return ret;
>>    	} else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) {
>>    		/* partial write? pretend nothing got written */
>> -		len2 = 0;
>>    		p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE;
>> +		sock_errset(EAGAIN);
>> +		len2 = -1;
>>    	}
>>
>>    	/* return number of bytes encoded, not written, to ensure
>
> I have applied the patch and the debian packages built fine.  I have
> installed the new packages on the master servers in our dev and test
> environments.  Initial tests show that basic replication is working.
> Of course, this problem did not exhibit itself in our test
> environments.  I will install it in our production environment Friday
> evening and let you know what happens.
>
> Am I correct in assuming that the root of the problem is on the providers?

Yes. Really, the problem will occur when either a client or server writes a 
large buffer thru a SASL encrypted connection, but this is more likely to 
happen when a server is sending responses.

> I am asking because I was not planning installing the new version on the
> consumers right away since 'emergency' changes to production software
> require special approval.  I would like to upgrade the providers in our
> next maintenance window which is next Thursday.  If you think I should
> do it with the update of the master let me know.

It's probably not urgent for the consumers, just bear in mind that any clients 
querying them thru SASL could also run into similar issues.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 6 Howard Chu 2010-09-12 01:10:12 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 7 bill@ca-zephyr.org 2010-09-12 06:38:38 UTC

--On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu <hyc@symas.com> wrote:

> whm@stanford.edu wrote:
>> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu>  wrote:
>>
>> The problem with the database was only coincidental.  Restoring the database
>> got the failing replica past the problem replication event.
>>
>> In the replica pool of 6 servers we have seen the problem on there of the
>> servers.  In thinking about this more it is unlikely that it is a slave
>> problem since the slaves have been in use for about 6 weeks and we did
>> not see the problem.  Only when we changed the master to 2.4.23 did we
>> see the problem.  I have captured a master debug log of the problem
>> event.  It is at http://www.stanford.edu/~whm/files/master-debug.txt.
>>
>> Bill
>>
> Please try with this patch:
>
> Index: sasl.c
> ===================================================================
> RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v
> retrieving revision 1.79
> diff -u -r1.79 sasl.c
> --- sasl.c	13 Apr 2010 20:17:56 -0000	1.79
> +++ sasl.c	10 Sep 2010 05:42:22 -0000
> @@ -733,8 +733,9 @@
>   		return ret;
>   	} else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) {
>   		/* partial write? pretend nothing got written */
> -		len2 = 0;
>   		p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE;
> +		sock_errset(EAGAIN);
> +		len2 = -1;
>   	}
>
>   	/* return number of bytes encoded, not written, to ensure

Howard,

The patched packages where installed last night on the production OpenLDAP
master with two of the replicas in the failing state.  Once the patched
slapd was started the two problem replicas quickly caught up and everything
looks good now.

Thanks again for your help.

Bill

-- 

Bill MacAllister
Infrastructure Delivery Group, Stanford University

Comment 8 Howard Chu 2010-09-12 08:05:33 UTC
Bill MacAllister wrote:
> Howard,
>
> The patched packages where installed last night on the production OpenLDAP
> master with two of the replicas in the failing state.  Once the patched
> slapd was started the two problem replicas quickly caught up and everything
> looks good now.

Thanks for the confirmation. I'll commit a final fix to HEAD shortly.
>
> Thanks again for your help.
>
> Bill
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 9 Quanah Gibson-Mount 2010-12-11 17:47:42 UTC
changed notes
changed state Test to Release
Comment 10 Quanah Gibson-Mount 2011-02-14 12:29:59 UTC
changed notes
changed state Release to Closed
Comment 11 OpenLDAP project 2014-08-01 21:04:30 UTC
fixed in HEAD
fixed in RE24