Issue 5404 - back-ldap fails intermittently with 80 Internal (implementation specific) error
Summary: back-ldap fails intermittently with 80 Internal (implementation specific) error
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-07 15:56 UTC by toby@inf.ed.ac.uk
Modified: 2014-08-01 21:04 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description toby@inf.ed.ac.uk 2008-03-07 15:56:43 UTC
Full_Name: Toby Blake
Version: 2.3.38 amd 2.3.40
OS: Fedora Core 5 and Fedora Core 6
URL: 
Submission from: (NULL) (129.215.218.33)


Hi there,

We're seeing a (seemingly) intermittent problem when using back-ldap
(with or without using the pcache overlay for caching), where 1 in x
queries fail with "result: 80 Internal (implementation specific)
error", where x is anywhere between 2 and 10.  We're seeing this with
both openldap-2.3.38 and openldap-2.3.40 - running on Fedora Core 5
and 6.  The remote server is running FC5, openldap 2.3.38.  I
currently have a machine exhibiting this problem - it's a machine in a
student lab.

The database part of slapd.conf on this machine is:

database                ldap
suffix                  dc=inf,dc=ed,dc=ac,dc=uk
rootdn                 
uid=ldaprep/inganoust.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth
uri                     ldap://testdir.inf.ed.ac.uk/
idassert-bind           mode=none
                        bindmethod=sasl
                        saslmech=GSSAPI
idassert-authzFrom      "dn:*"

I'm afraid I can't currently reproduce this error, but hopefully some
of the information below will help...

I've done some stepping through the code and what seems to be
happening when the error occurs is that the call to getpeername in
ldap_host_connected_to in libraries/libldap/os-ip.c:590 fails with
errno=107 "transport endpoint is not connected" and thereafter
ldap_int_sasl_open (cyrus.c:518) is called with NULL as the 'host'
argument (in our case it should be 'hadrian.inf.ed.ac.uk') - this
results in LDAP_LOCAL_ERROR being returned.

The backtrace (in ldap_host_connected_to) is:

(gdb) bt
#0  ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost")
    at os-ip.c:586
#1  0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, 
    mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, 
    interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98)
    at cyrus.c:643
#2  0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, 
    mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, 
    interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479
#3  0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, 
    rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1)
    at bind.c:1997
#4  0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4)
    at search.c:166
#5  0x0806317f in fe_op_search (op=0x8952900, rs=0xb370f1c4) at search.c:355
#6  0x08063b31 in do_search (op=0x8952900, rs=0xb370f1c4) at search.c:217
#7  0x08061209 in connection_operation (ctx=0xb370f238, arg_v=0x8952900)
    at connection.c:1133
#8  0x081322f3 in ldap_int_thread_pool_wrapper (xpool=0x88a1d18) at tpool.c:478
#9  0x00db145b in start_thread () from /lib/libpthread.so.0
#10 0x001cf23e in clone () from /lib/libc.so.6
(gdb)

When getpeername returns the error, it is always for the same 'sd'
value - on the machine I'm looking at, it's 13, e.g.

590             if ( getpeername( sd, sa, &len ) == -1 ) {
(gdb) p sd
$3 = 13
(gdb) p errno
$4 = 107
(gdb) 

What's interesting here is that if I use lsof to see the filehandles
that slapd is holding open, I see this for FD 13:

COMMAND  PID USER   FD   TYPE     DEVICE    SIZE    NODE    NAME
slapd   2551 ldap   13u  sock     0,5		  3603708   can't identify protocol

Interestingly, this is the exact same output I see from lsof if a
socket() has been created, but not connect()ed.

It appears then, that a socket of this type gets into the pool of
connections to the remote server being used by the local slapd, thus
explaining the 1 in x failure rate - as it cycles through the pool.

I currently have a machine in this state, so if there's more
information I could usefully provide, then let me know.  I can of
course change the loglevel, but that would involve restarting slapd on
this machine - which fixes the problem.  Anyway, let me know if
there's more I can do - I can experiment on other machines.


Cheers
Toby Blake
School of Informatics
University of Edinburgh

Comment 1 ando@openldap.org 2008-03-08 13:48:30 UTC
toby@inf.ed.ac.uk wrote:

> (gdb) bt
> #0  ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost")
>     at os-ip.c:586
> #1  0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, 
>     mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, 
>     interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98)
>     at cyrus.c:643
> #2  0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, 
>     mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, 
>     interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479
> #3  0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, 
>     rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1)
>     at bind.c:1997
> #4  0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4)
>     at search.c:166

I think I see the issue: either ldap_back_search() or
ldap_back_dobind_int() should invalidate that handle.  I'll see if I can
fix it.

p.



Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Email:   pierangelo.masarati@sys-net.it
---------------------------------------


Comment 2 ando@openldap.org 2008-03-08 14:12:45 UTC
toby@inf.ed.ac.uk wrote:

> The backtrace (in ldap_host_connected_to) is:
> 
> (gdb) bt
> #0  ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost")
>     at os-ip.c:586
> #1  0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, 
>     mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, 
>     interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98)
>     at cyrus.c:643
> #2  0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, 
>     mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, 
>     interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479
> #3  0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, 
>     rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1)
>     at bind.c:1997
> #4  0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4)
>     at search.c:166

Actually, I was too quick: the above stack backtrace looks inconsistent:
there is a call to ldap_sasl_interactive_bind_s at bind.c:1997, but it's
not inside ldap_back_dobind_int(); it's rather in
ldap_back_proxy_authz_bind(), which might be called by
ldap_back_dobind_int().  It might be useful if you try to see if the
issue can be reproduced with a non-optimized build, so that we can see
the actual behavior of the code from the backtrace, and see some arg
values that are now missing.

What could actually be happening is that although the authentication
fails, the fact that LDAP_OTHER is returned is not considered a
condition that requires to invalidate the connection handler, while
probably it should.  I'll provide a fix in this sense.

p.



Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Email:   pierangelo.masarati@sys-net.it
---------------------------------------


Comment 3 ando@openldap.org 2008-03-08 14:44:19 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 4 ando@openldap.org 2008-03-08 14:45:03 UTC
Please test and report;

servers/slapd/back-ldap/bind.c
new revision: 1.244; previous revision: 1.243

The patch applies straightforwardly to re23 only if the first chunk is
omitted.

p.



Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Email:   pierangelo.masarati@sys-net.it
---------------------------------------


Comment 5 toby@inf.ed.ac.uk 2008-03-10 09:29:04 UTC
On 8 Mar 2008, at 14:45, Pierangelo Masarati wrote:

> Please test and report;
>
> servers/slapd/back-ldap/bind.c
> new revision: 1.244; previous revision: 1.243
>
> The patch applies straightforwardly to re23 only if the first chunk is
> omitted.

OK, thanks.  I'll patch and put into testing.

Cheers
Toby

Comment 6 toby@inf.ed.ac.uk 2008-03-31 12:58:31 UTC
> Please test and report;
>
> servers/slapd/back-ldap/bind.c
> new revision: 1.244; previous revision: 1.243
>
> The patch applies straightforwardly to re23 only if the first chunk is
> omitted.

Been running for 3+ weeks now on various test machines with no
recurrence of error, so I'd consider this resolved.

Many thanks
Toby


Comment 7 Gavin Henry 2008-04-06 20:11:22 UTC
toby@inf.ed.ac.uk wrote:
>> Please test and report;
>>
>> servers/slapd/back-ldap/bind.c
>> new revision: 1.244; previous revision: 1.243
>>
>> The patch applies straightforwardly to re23 only if the first chunk is
>> omitted.
> 
> Been running for 3+ weeks now on various test machines with no
> recurrence of error, so I'd consider this resolved.
> 
> Many thanks
> Toby
> 
> 
> 

Thanks for the followup.

Gavin.

-- 
Kind Regards,

Gavin Henry.
OpenLDAP Engineering Team.

E ghenry@OpenLDAP.org

Community developed LDAP software.

http://www.openldap.org/project/

Comment 8 Quanah Gibson-Mount 2008-05-05 23:57:08 UTC
changed notes
changed state Test to Release
Comment 9 Quanah Gibson-Mount 2008-05-13 03:17:01 UTC
changed notes
Comment 10 Quanah Gibson-Mount 2008-05-28 16:39:54 UTC
changed notes
changed state Release to Closed
Comment 11 OpenLDAP project 2014-08-01 21:04:13 UTC
fixed in HEAD
fixed in RE24
fixed in RE23