3855 – recovery from EBADF error for slapd

Issue 3855 - recovery from EBADF error for slapd

Summary: recovery from EBADF error for slapd

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-07-11 20:53 UTC by jtownsend@opendarwin.org
Modified:	2014-08-01 21:05 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description jtownsend@opendarwin.org 2005-07-11 20:53:08 UTC

Full_Name: Jason Townsend
Version: 2.2.19, HEAD
OS: Mac OS X 10.4.1
URL: http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch
Submission from: (NULL) (17.221.43.142)


If slapd encounters an EBADF error in its main select loop in
servers/slapd/daemon.c, it simply ignores it and attempts the select with the
same set of file descriptors as before. It appears the intent was to quit slapd
after 16 of these errors occur, but in my experience it would cause slapd to go
into a 100% CPU usage spin since select returns immediately in the EBADF case
rather than blocking, and somehow 16 error limit was not triggered.

I've made a change which allows slapd to find which file descriptor is bad and
remove it from the list so it can recover from this case without needing to
quit. The patch was originally developed against 2.2.19, but I've merged it into
HEAD. It probably also could apply to 2.3.x. Note that I did not create a
version for the HAVE_WINSOCK case since I don't have a Windows development
system to test that on but hopefully a similar strategy could be used there.

http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch

Comment 1 Howard Chu 2005-07-11 21:31:31 UTC

ITS#3400 was supposed to correct the counter, so that the limit of 16 
tries gets detected and then slapd shuts down. Do you have a test case 
to reproduce the EBADF situation in the first place, that demonstrates 
that the patch for ITS#3400 is still broken? (Note that the fix of 
ITS#3400 is part of 2.2.19.)

The patch looks interesting; probably it should break out of the loop 
after it detects a single bad descriptor. (It is already pretty rare to 
have one bad descriptor, what's the likelihood of more than one?)

I haven't looked closely at the code yet, does it work with outbound 
connections too (e.g. syncrepl consumer)?

jtownsend@opendarwin.org wrote:
> Full_Name: Jason Townsend
> Version: 2.2.19, HEAD
> OS: Mac OS X 10.4.1
> URL: http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch
> Submission from: (NULL) (17.221.43.142)
>
>
> If slapd encounters an EBADF error in its main select loop in
> servers/slapd/daemon.c, it simply ignores it and attempts the select with the
> same set of file descriptors as before. It appears the intent was to quit slapd
> after 16 of these errors occur, but in my experience it would cause slapd to go
> into a 100% CPU usage spin since select returns immediately in the EBADF case
> rather than blocking, and somehow 16 error limit was not triggered.
>
> I've made a change which allows slapd to find which file descriptor is bad and
> remove it from the list so it can recover from this case without needing to
> quit. The patch was originally developed against 2.2.19, but I've merged it into
> HEAD. It probably also could apply to 2.3.x. Note that I did not create a
> version for the HAVE_WINSOCK case since I don't have a Windows development
> system to test that on but hopefully a similar strategy could be used there.
>
> http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch
>
>
>
> .
>
>   

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

Comment 2 jtownsend@opendarwin.org 2005-07-12 01:38:37 UTC

On Jul 11, 2005, at 2:31 PM, Howard Chu wrote:
> ITS#3400 was supposed to correct the counter, so that the limit of  
> 16 tries gets detected and then slapd shuts down. Do you have a  
> test case to reproduce the EBADF situation in the first place, that  
> demonstrates that the patch for ITS#3400 is still broken? (Note  
> that the fix of ITS#3400 is part of 2.2.19.)

It looks like this fix was done in before I merged in 2.2.19 actually  
(I know, bad Jason for only just submitting this patch now), so at  
the time I was running into the 100% CPU rather than an abnormal exit  
of slapd which is what ITS 3400 provided. Either way I think having  
slapd be able to recover from this on its own is better.

> The patch looks interesting; probably it should break out of the  
> loop after it detects a single bad descriptor. (It is already  
> pretty rare to have one bad descriptor, what's the likelihood of  
> more than one?)

That would be easy enough to do. If there was more than one bad  
descriptor though you'd have to iterate over the descriptors once for  
each of them in that case.

> I haven't looked closely at the code yet, does it work with  
> outbound connections too (e.g. syncrepl consumer)?

We don't currently use syncrepl so it's possible this patch needs to  
be modified to take that into account... are those connections in the  
same connection pool as incoming connections?

-Jason

Comment 3 Howard Chu 2005-07-12 02:17:00 UTC

Jason Townsend wrote:
>  It looks like this fix was done in before I merged in 2.2.19 actually
>  (I know, bad Jason for only just submitting this patch now), so at
>  the time I was running into the 100% CPU rather than an abnormal exit
>  of slapd which is what ITS 3400 provided. Either way I think having
>  slapd be able to recover from this on its own is better.

Well yes, ordinarily I would agree. But this is one of those "should 
never happen" situations and since we haven't got a reliable means of 
reproducing it, we really have no idea what it means when the situation 
does occur. The only time I can recall seeing it was with someone's 
custom back-perl script that was closing and dup'ing descriptors without 
mutexes. Their code was doing something like
    x = open(some file);
    close(fd);
    dup2( x, fd );
(I have no idea why.) There is a race condition where the descriptor got 
closed and was immediately re-used by a call to accept(). When their 
dup2 executed, they were stomping on the socket descriptor because they 
thought it still pointed at their flat file. (The fix is not to do the 
explicit close, since dup/dup2 will do that automatically and 
atomically.) Anyway, in cases like this of severe programmer error the 
connection table consistency is totally shot, and you cannot rely on it, 
so the safest thing is to actually bail out.

> > The patch looks interesting; probably it should break out of the
> > loop after it detects a single bad descriptor. (It is already
> > pretty rare to have one bad descriptor, what's the likelihood of
> > more than one?)
>
>  That would be easy enough to do. If there was more than one bad
>  descriptor though you'd have to iterate over the descriptors once for
>  each of them in that case.
>
> > I haven't looked closely at the code yet, does it work with
> > outbound connections too (e.g. syncrepl consumer)?
>
>  We don't currently use syncrepl so it's possible this patch needs to
>  be modified to take that into account... are those connections in the
>  same connection pool as incoming connections?

Yes, the consumer connections use the same connection pool, though the 
structures are filled a little bit dfferently.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

Comment 4 jtownsend@opendarwin.org 2005-07-12 05:23:37 UTC

On Jul 11, 2005, at 7:17 PM, Howard Chu wrote:
> Well yes, ordinarily I would agree. But this is one of those  
> "should never happen" situations and since we haven't got a  
> reliable means of reproducing it, we really have no idea what it  
> means when the situation does occur. The only time I can recall  
> seeing it was with someone's custom back-perl script that was  
> closing and dup'ing descriptors without mutexes. Their code was  
> doing something like
>    x = open(some file);
>    close(fd);
>    dup2( x, fd );
> (I have no idea why.) There is a race condition where the  
> descriptor got closed and was immediately re-used by a call to  
> accept(). When their dup2 executed, they were stomping on the  
> socket descriptor because they thought it still pointed at their  
> flat file. (The fix is not to do the explicit close, since dup/dup2  
> will do that automatically and atomically.) Anyway, in cases like  
> this of severe programmer error the connection table consistency is  
> totally shot, and you cannot rely on it, so the safest thing is to  
> actually bail out.

I'd love to find the problem that is causing the bad file descriptor  
in the first place... but in the absence of that fix it seemed better  
to take the approach of this patch than do nothing. The script I used  
to reproduce this is here:

http://www.opendarwin.org/~jtownsend/bindstress.pl

If you start up about 6 instances of that script it is a pretty good  
torture test. It just runs ldapwhoami over and over with no delay in  
between using random DNs to bind as. It assumes all the passwords are  
set to the same thing. There are a few parameters at the top of the  
script that can be adjusted depending on the particular LDAP server  
in use. Typically I had at least 2000 user record when doing this test.

It's entirely possible that this problem is specific to the version  
of OpenLDAP included with Mac OS X as we have some changes in there  
to support Password Server authentication, but we were never able to  
track down a problem within that code which was causing EBADF errors.

-Jason

Comment 5 Howard Chu 2005-10-28 22:23:18 UTC

Jason Townsend wrote:
> On Jul 11, 2005, at 7:17 PM, Howard Chu wrote:
>> Well yes, ordinarily I would agree. But this is one of those "should 
>> never happen" situations and since we haven't got a reliable means of 
>> reproducing it, we really have no idea what it means when the 
>> situation does occur. The only time I can recall seeing it was with 
>> someone's custom back-perl script that was closing and dup'ing 
>> descriptors without mutexes. Their code was doing something like
>>    x = open(some file);
>>    close(fd);
>>    dup2( x, fd );
>> (I have no idea why.) There is a race condition where the descriptor 
>> got closed and was immediately re-used by a call to accept(). When 
>> their dup2 executed, they were stomping on the socket descriptor 
>> because they thought it still pointed at their flat file. (The fix is 
>> not to do the explicit close, since dup/dup2 will do that 
>> automatically and atomically.) Anyway, in cases like this of severe 
>> programmer error the connection table consistency is totally shot, 
>> and you cannot rely on it, so the safest thing is to actually bail out.
>
> I'd love to find the problem that is causing the bad file descriptor 
> in the first place... but in the absence of that fix it seemed better 
> to take the approach of this patch than do nothing. The script I used 
> to reproduce this is here:
>
> http://www.opendarwin.org/~jtownsend/bindstress.pl
>
> If you start up about 6 instances of that script it is a pretty good 
> torture test. It just runs ldapwhoami over and over with no delay in 
> between using random DNs to bind as. It assumes all the passwords are 
> set to the same thing. There are a few parameters at the top of the 
> script that can be adjusted depending on the particular LDAP server in 
> use. Typically I had at least 2000 user record when doing this test.
>
> It's entirely possible that this problem is specific to the version of 
> OpenLDAP included with Mac OS X as we have some changes in there to 
> support Password Server authentication, but we were never able to 
> track down a problem within that code which was causing EBADF errors.

If your slapd has to talk to an external Password Server to validate 
these Binds, then I'm going to have to place the blame on that 
communication step. I've been unable to reproduce this EBADF situation 
on the plain OpenLDAP code.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

Comment 6 Kurt Zeilenga 2005-11-03 19:00:45 UTC

changed notes
changed state Open to Suspended

Comment 7 Howard Chu 2005-11-16 06:50:27 UTC

changed state Suspended to Closed

Comment 8 Howard Chu 2006-01-19 18:12:11 UTC

changed notes

Comment 9 Howard Chu 2009-02-17 05:28:27 UTC

moved from Incoming to Archive.Incoming

Comment 10 OpenLDAP project 2014-08-01 21:05:56 UTC

could not dup in plain OpenLDAP Software, close?
probably fixed in ITS#4338