Full_Name: Toby Blake Version: 2.3.38 amd 2.3.40 OS: Fedora Core 5 and Fedora Core 6 URL: Submission from: (NULL) (129.215.218.33) Hi there, We're seeing a (seemingly) intermittent problem when using back-ldap (with or without using the pcache overlay for caching), where 1 in x queries fail with "result: 80 Internal (implementation specific) error", where x is anywhere between 2 and 10. We're seeing this with both openldap-2.3.38 and openldap-2.3.40 - running on Fedora Core 5 and 6. The remote server is running FC5, openldap 2.3.38. I currently have a machine exhibiting this problem - it's a machine in a student lab. The database part of slapd.conf on this machine is: database ldap suffix dc=inf,dc=ed,dc=ac,dc=uk rootdn uid=ldaprep/inganoust.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth uri ldap://testdir.inf.ed.ac.uk/ idassert-bind mode=none bindmethod=sasl saslmech=GSSAPI idassert-authzFrom "dn:*" I'm afraid I can't currently reproduce this error, but hopefully some of the information below will help... I've done some stepping through the code and what seems to be happening when the error occurs is that the call to getpeername in ldap_host_connected_to in libraries/libldap/os-ip.c:590 fails with errno=107 "transport endpoint is not connected" and thereafter ldap_int_sasl_open (cyrus.c:518) is called with NULL as the 'host' argument (in our case it should be 'hadrian.inf.ed.ac.uk') - this results in LDAP_LOCAL_ERROR being returned. The backtrace (in ldap_host_connected_to) is: (gdb) bt #0 ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost") at os-ip.c:586 #1 0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at cyrus.c:643 #2 0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479 #3 0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1) at bind.c:1997 #4 0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4) at search.c:166 #5 0x0806317f in fe_op_search (op=0x8952900, rs=0xb370f1c4) at search.c:355 #6 0x08063b31 in do_search (op=0x8952900, rs=0xb370f1c4) at search.c:217 #7 0x08061209 in connection_operation (ctx=0xb370f238, arg_v=0x8952900) at connection.c:1133 #8 0x081322f3 in ldap_int_thread_pool_wrapper (xpool=0x88a1d18) at tpool.c:478 #9 0x00db145b in start_thread () from /lib/libpthread.so.0 #10 0x001cf23e in clone () from /lib/libc.so.6 (gdb) When getpeername returns the error, it is always for the same 'sd' value - on the machine I'm looking at, it's 13, e.g. 590 if ( getpeername( sd, sa, &len ) == -1 ) { (gdb) p sd $3 = 13 (gdb) p errno $4 = 107 (gdb) What's interesting here is that if I use lsof to see the filehandles that slapd is holding open, I see this for FD 13: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME slapd 2551 ldap 13u sock 0,5 3603708 can't identify protocol Interestingly, this is the exact same output I see from lsof if a socket() has been created, but not connect()ed. It appears then, that a socket of this type gets into the pool of connections to the remote server being used by the local slapd, thus explaining the 1 in x failure rate - as it cycles through the pool. I currently have a machine in this state, so if there's more information I could usefully provide, then let me know. I can of course change the loglevel, but that would involve restarting slapd on this machine - which fixes the problem. Anyway, let me know if there's more I can do - I can experiment on other machines. Cheers Toby Blake School of Informatics University of Edinburgh
toby@inf.ed.ac.uk wrote: > (gdb) bt > #0 ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost") > at os-ip.c:586 > #1 0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, > mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, > interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) > at cyrus.c:643 > #2 0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, > mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, > interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479 > #3 0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, > rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1) > at bind.c:1997 > #4 0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4) > at search.c:166 I think I see the issue: either ldap_back_search() or ldap_back_dobind_int() should invalidate that handle. I'll see if I can fix it. p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
toby@inf.ed.ac.uk wrote: > The backtrace (in ldap_host_connected_to) is: > > (gdb) bt > #0 ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost") > at os-ip.c:586 > #1 0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, > mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, > interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) > at cyrus.c:643 > #2 0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, > mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, > interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479 > #3 0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, > rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1) > at bind.c:1997 > #4 0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4) > at search.c:166 Actually, I was too quick: the above stack backtrace looks inconsistent: there is a call to ldap_sasl_interactive_bind_s at bind.c:1997, but it's not inside ldap_back_dobind_int(); it's rather in ldap_back_proxy_authz_bind(), which might be called by ldap_back_dobind_int(). It might be useful if you try to see if the issue can be reproduced with a non-optimized build, so that we can see the actual behavior of the code from the backtrace, and see some arg values that are now missing. What could actually be happening is that although the authentication fails, the fact that LDAP_OTHER is returned is not considered a condition that requires to invalidate the connection handler, while probably it should. I'll provide a fix in this sense. p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
changed notes changed state Open to Test moved from Incoming to Software Bugs
Please test and report; servers/slapd/back-ldap/bind.c new revision: 1.244; previous revision: 1.243 The patch applies straightforwardly to re23 only if the first chunk is omitted. p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
On 8 Mar 2008, at 14:45, Pierangelo Masarati wrote: > Please test and report; > > servers/slapd/back-ldap/bind.c > new revision: 1.244; previous revision: 1.243 > > The patch applies straightforwardly to re23 only if the first chunk is > omitted. OK, thanks. I'll patch and put into testing. Cheers Toby
> Please test and report; > > servers/slapd/back-ldap/bind.c > new revision: 1.244; previous revision: 1.243 > > The patch applies straightforwardly to re23 only if the first chunk is > omitted. Been running for 3+ weeks now on various test machines with no recurrence of error, so I'd consider this resolved. Many thanks Toby
toby@inf.ed.ac.uk wrote: >> Please test and report; >> >> servers/slapd/back-ldap/bind.c >> new revision: 1.244; previous revision: 1.243 >> >> The patch applies straightforwardly to re23 only if the first chunk is >> omitted. > > Been running for 3+ weeks now on various test machines with no > recurrence of error, so I'd consider this resolved. > > Many thanks > Toby > > > Thanks for the followup. Gavin. -- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/
changed notes changed state Test to Release
changed notes
changed notes changed state Release to Closed
fixed in HEAD fixed in RE24 fixed in RE23