Full_Name: Bill MacAllister Version: 2.4.23-1 OS: Debian Lenny URL: http://www.stanford.edu/~whm/files/ldap-replication-error.txt Submission from: (NULL) (99.55.172.208) I just replaced a 2.3.43 OpenLDAP master with a 2.4.23-1 master. There are 6 slaves all running 2.4.23-1. After reloading the master database, I confirmed that replication was working my making some changes and then examining contextCSNs. Sometime later replication stopped on one of the servers. Looking at the log on the failing slave I saw: Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - SYNC_ID_SET Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 (-1) Can't contact LDAP server Sep 2 22:12:53 directory3 slapd[7005]: do_syncrepl: rid=000 rc -1 retrying On the master I saw: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 ACCEPT from IP=171.67.218.156:44519 (IP=0.0.0.0:389) Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND authcid="ldap/directory3.stanford.edu@stanford.edu" authzid="ldap/directory3.stanford.edu@stanford.edu" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="cn=directory3,cn=ldap,cn=operational,dc=stanford,dc=edu" mech=GSSAPI sasl_ssf=56 ssf=56 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 RESULT tag=97 err=0 text= Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH base="cn=accesslog" scope=2 deref=0 filter="(&(objectClass=auditWriteObject)(reqResult=0))" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 closed (connection lost on write) Running slapd on the failing slave with -d -1 I saw: =>do_syncrep2 rid=000 ldap_result ld 0x1c07000 msgid 4 wait4msg ld 0x1c07000 msgid 4 (timeout 0 usec) wait4msg continue ld 0x1c07000 msgid 4 all 0 ** ld 0x1c07000 Connections: * host: directory-master.stanford.edu port: 389 (default) refcnt: 2 status: Connected last used: Thu Sep 2 22:12:53 2010 ** ld 0x1c07000 Outstanding Requests: * msgid 4, origid 4, status InProgress outstanding referrals 0, parent count 0 ld 0x1c07000 request count 1 (abandoned 0) ** ld 0x1c07000 Response Queue: Empty ld 0x1c07000 response count 0 ldap_chkResponseList ld 0x1c07000 msgid 4 all 0 ldap_chkResponseList returns ld 0x1c07000 NULL ldap_int_select read1msg: ld 0x1c07000 msgid 4 all 0 ber_get_next sasl_generic_read: want=4, got=4 0000: 00 00 00 78 ...x sasl_generic_read: want=120, got=120 ... lots of output ... sasl_generic_read: want=45833, got=0 ldap_read: want=8, got=0 ber_get_next failed. ldap_err2string do_syncrep2: rid=000 (-1) Can't contact LDAP server I am including a pointer to the complete debug log. Looks like something in the sasl support is unhappy. We are using libsasl2-modules-gssapi-heimdal 2.1.23.dfsg1-5~sbp50+2. Strange to me that this one server is having a problem. There are actually 5 slaves that are working just fine. Here are current contextCSNs for the systems. directory-master contextCSN: 20100903060040.713677Z#000000#000#000000 directory1 contextCSN: 20100903060040.713677Z#000000#000#000000 directory2 contextCSN: 20100903060040.713677Z#000000#000#000000 directory3 contextCSN: 20100903015531.064407Z#000000#000#000000 directory4 contextCSN: 20100903060040.713677Z#000000#000#000000 directory11 contextCSN: 20100903060040.713677Z#000000#000#000000 directory12 contextCSN: 20100903060040.713677Z#000000#000#000000 What else should I look at? Bill
Worked on this a bit more and in the end rebooted the server just to be sure I had a clean environment. When the system came back up slapd refused to start, the bdb database was corrupt. I reloaded from the master and replication is again working. Not sure why a reboot was required to uncover the error. I had stopped and started slapd multiple times without an errors being reported. Sorry for the inbox pollution. Bill -- Bill MacAllister Infrastructure Delivery Group, Stanford University
--On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister <whm@stanford.edu> wrote: The problem with the database was only coincidental. Restoring the database got the failing replica past the problem replication event. In the replica pool of 6 servers we have seen the problem on there of the servers. In thinking about this more it is unlikely that it is a slave problem since the slaves have been in use for about 6 weeks and we did not see the problem. Only when we changed the master to 2.4.23 did we see the problem. I have captured a master debug log of the problem event. It is at http://www.stanford.edu/~whm/files/master-debug.txt. Bill -- Bill MacAllister Infrastructure Delivery Group, Stanford University
whm@stanford.edu wrote: > --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu> wrote: > > The problem with the database was only coincidental. Restoring the database > got the failing replica past the problem replication event. > > In the replica pool of 6 servers we have seen the problem on there of the > servers. In thinking about this more it is unlikely that it is a slave > problem since the slaves have been in use for about 6 weeks and we did > not see the problem. Only when we changed the master to 2.4.23 did we > see the problem. I have captured a master debug log of the problem > event. It is at http://www.stanford.edu/~whm/files/master-debug.txt. > > Bill > Please try with this patch: Index: sasl.c =================================================================== RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v retrieving revision 1.79 diff -u -r1.79 sasl.c --- sasl.c 13 Apr 2010 20:17:56 -0000 1.79 +++ sasl.c 10 Sep 2010 05:42:22 -0000 @@ -733,8 +733,9 @@ return ret; } else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) { /* partial write? pretend nothing got written */ - len2 = 0; p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE; + sock_errset(EAGAIN); + len2 = -1; } /* return number of bytes encoded, not written, to ensure -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu <hyc@symas.com> wrote: > whm@stanford.edu wrote: >> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu> wrote: >> >> The problem with the database was only coincidental. Restoring the database >> got the failing replica past the problem replication event. >> >> In the replica pool of 6 servers we have seen the problem on there of the >> servers. In thinking about this more it is unlikely that it is a slave >> problem since the slaves have been in use for about 6 weeks and we did >> not see the problem. Only when we changed the master to 2.4.23 did we >> see the problem. I have captured a master debug log of the problem >> event. It is at http://www.stanford.edu/~whm/files/master-debug.txt. >> >> Bill >> > Please try with this patch: > > Index: sasl.c > =================================================================== > RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v > retrieving revision 1.79 > diff -u -r1.79 sasl.c > --- sasl.c 13 Apr 2010 20:17:56 -0000 1.79 > +++ sasl.c 10 Sep 2010 05:42:22 -0000 > @@ -733,8 +733,9 @@ > return ret; > } else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) { > /* partial write? pretend nothing got written */ > - len2 = 0; > p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE; > + sock_errset(EAGAIN); > + len2 = -1; > } > > /* return number of bytes encoded, not written, to ensure I have applied the patch and the debian packages built fine. I have installed the new packages on the master servers in our dev and test environments. Initial tests show that basic replication is working. Of course, this problem did not exhibit itself in our test environments. I will install it in our production environment Friday evening and let you know what happens. Am I correct in assuming that the root of the problem is on the providers? I am asking because I was not planning installing the new version on the consumers right away since 'emergency' changes to production software require special approval. I would like to upgrade the providers in our next maintenance window which is next Thursday. If you think I should do it with the update of the master let me know. Bill -- Bill MacAllister Infrastructure Delivery Group, Stanford University
whm@stanford.edu wrote: > --On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu<hyc@symas.com> wrote: > >> whm@stanford.edu wrote: >>> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu> wrote: >>> >>> The problem with the database was only coincidental. Restoring the database >>> got the failing replica past the problem replication event. >>> >>> In the replica pool of 6 servers we have seen the problem on there of the >>> servers. In thinking about this more it is unlikely that it is a slave >>> problem since the slaves have been in use for about 6 weeks and we did >>> not see the problem. Only when we changed the master to 2.4.23 did we >>> see the problem. I have captured a master debug log of the problem >>> event. It is at http://www.stanford.edu/~whm/files/master-debug.txt. >>> >>> Bill >>> >> Please try with this patch: >> >> Index: sasl.c >> =================================================================== >> RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v >> retrieving revision 1.79 >> diff -u -r1.79 sasl.c >> --- sasl.c 13 Apr 2010 20:17:56 -0000 1.79 >> +++ sasl.c 10 Sep 2010 05:42:22 -0000 >> @@ -733,8 +733,9 @@ >> return ret; >> } else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) { >> /* partial write? pretend nothing got written */ >> - len2 = 0; >> p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE; >> + sock_errset(EAGAIN); >> + len2 = -1; >> } >> >> /* return number of bytes encoded, not written, to ensure > > I have applied the patch and the debian packages built fine. I have > installed the new packages on the master servers in our dev and test > environments. Initial tests show that basic replication is working. > Of course, this problem did not exhibit itself in our test > environments. I will install it in our production environment Friday > evening and let you know what happens. > > Am I correct in assuming that the root of the problem is on the providers? Yes. Really, the problem will occur when either a client or server writes a large buffer thru a SASL encrypted connection, but this is more likely to happen when a server is sending responses. > I am asking because I was not planning installing the new version on the > consumers right away since 'emergency' changes to production software > require special approval. I would like to upgrade the providers in our > next maintenance window which is next Thursday. If you think I should > do it with the update of the master let me know. It's probably not urgent for the consumers, just bear in mind that any clients querying them thru SASL could also run into similar issues. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Test moved from Incoming to Software Bugs
--On Thursday, September 09, 2010 10:43:06 PM -0700 Howard Chu <hyc@symas.com> wrote: > whm@stanford.edu wrote: >> --On Friday, September 03, 2010 01:23:17 AM -0700 Bill MacAllister<whm@stanford.edu> wrote: >> >> The problem with the database was only coincidental. Restoring the database >> got the failing replica past the problem replication event. >> >> In the replica pool of 6 servers we have seen the problem on there of the >> servers. In thinking about this more it is unlikely that it is a slave >> problem since the slaves have been in use for about 6 weeks and we did >> not see the problem. Only when we changed the master to 2.4.23 did we >> see the problem. I have captured a master debug log of the problem >> event. It is at http://www.stanford.edu/~whm/files/master-debug.txt. >> >> Bill >> > Please try with this patch: > > Index: sasl.c > =================================================================== > RCS file: /repo/OpenLDAP/pkg/ldap/libraries/libldap/sasl.c,v > retrieving revision 1.79 > diff -u -r1.79 sasl.c > --- sasl.c 13 Apr 2010 20:17:56 -0000 1.79 > +++ sasl.c 10 Sep 2010 05:42:22 -0000 > @@ -733,8 +733,9 @@ > return ret; > } else if ( p->buf_out.buf_ptr != p->buf_out.buf_end ) { > /* partial write? pretend nothing got written */ > - len2 = 0; > p->flags |= LDAP_PVT_SASL_PARTIAL_WRITE; > + sock_errset(EAGAIN); > + len2 = -1; > } > > /* return number of bytes encoded, not written, to ensure Howard, The patched packages where installed last night on the production OpenLDAP master with two of the replicas in the failing state. Once the patched slapd was started the two problem replicas quickly caught up and everything looks good now. Thanks again for your help. Bill -- Bill MacAllister Infrastructure Delivery Group, Stanford University
Bill MacAllister wrote: > Howard, > > The patched packages where installed last night on the production OpenLDAP > master with two of the replicas in the failing state. Once the patched > slapd was started the two problem replicas quickly caught up and everything > looks good now. Thanks for the confirmation. I'll commit a final fix to HEAD shortly. > > Thanks again for your help. > > Bill > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Test to Release
changed notes changed state Release to Closed
fixed in HEAD fixed in RE24