[Date Prev][Date Next] [Chronological] [Thread] [Top]

send_search_entry aborts b/c ber_flush fails with errno=0 (ITS#1891)



Full_Name: Gareth Bestor
Version: 2.0.14 (also 2.0.23?)
OS: Linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (129.42.208.144)


Using openldap 2.0.14 as part of Globus Grid Toolkit (www.globus.org). 
When doing slapd queries of more than a couple of machines we observe
intermittent
"Can't contact LDAP server failue" half way through receiving the data and the
query is aborted. I traced the problem to sb_write failing but returning
errno=0
(I still need to find out why). The particular error scenario causes ber_flush
to fail. eg

	Jun 14 16:30:16 pygar slapd[18648]: ber_flush failed errno=0 reason="Success"

which in turn causes send_search_entry to fail w/

	May 23 13:09:03 c279lx01 slapd[20424]: send_ldap_response: ber write failed

which aborts the LDAP query.

A fix/workaround that I tried is to change in ber_flush(),

                if ( err != EWOULDBLOCK && err != EAGAIN ) {

to

                if ( err != EWOULDBLOCK && err != EAGAIN && err != LDAP_SUCCESS
) {

that is, if ber_int_sb_write() fails but returns errno=0 then re-try sending, 
rather than abort. Tested the fix and it seems to work. I looked at the 2.0.23
source,
which also has the former in ber_flush, so the fix may be generally applicable.

A few other minor things you might want to consider:

In slapd/result.c/send_search_entry()
        Debug( LDAP_DEBUG_ANY, "send_ldap_response: ber write failed\n",0,0,0)
should be
        Debug( LDAP_DEBUG_ANY, "send_search_entry: ber write failed\n",0,0,0)

This erorr is misleading because there is a identical error message reported
in the *real* send_ldap_response()..



In ber_flush, the following loop attempts to send data even when there is no
data
to send, eg if to_write=0 (such as when ber_rwprt=NULL)

        do {
                rc = ber_int_sb_write( sb, ber->ber_rwptr, towrite );
                if (rc<=0) {
                        return -1;
                }
                towrite -= rc;
                nwritten += rc;
                ber->ber_rwptr += rc;
        } while ( towrite > 0 );

This might result in a misleading error condition being reported b/c 
ber_int_sb_write will return <= 0. If there is no data to send then perhaps the
following would be better instead

        while ( towrite > 0 ) {
                rc = ber_int_sb_write( sb, ber->ber_rwptr, towrite );
                if (rc<=0) {
                        return -1;
                }
                towrite -= rc;
                nwritten += rc;
                ber->ber_rwptr += rc;
        }

ie test for data to send BEFORE sending it rather than after. If ber_flush
somehow gets called with no data left to send the ber_int_sb_write()<=0 'hard
error'
could potentially result get propogated up, causing a successfully
completed query to abort right at the end.



I'm currently running 2.0.14 but noticed the same issues in 2.0.23 source.