Full_Name: Jason Townsend Version: 2.2.19, HEAD OS: Mac OS X 10.4.1 URL: http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch Submission from: (NULL) (17.221.43.142) If slapd encounters an EBADF error in its main select loop in servers/slapd/daemon.c, it simply ignores it and attempts the select with the same set of file descriptors as before. It appears the intent was to quit slapd after 16 of these errors occur, but in my experience it would cause slapd to go into a 100% CPU usage spin since select returns immediately in the EBADF case rather than blocking, and somehow 16 error limit was not triggered. I've made a change which allows slapd to find which file descriptor is bad and remove it from the list so it can recover from this case without needing to quit. The patch was originally developed against 2.2.19, but I've merged it into HEAD. It probably also could apply to 2.3.x. Note that I did not create a version for the HAVE_WINSOCK case since I don't have a Windows development system to test that on but hopefully a similar strategy could be used there. http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch
ITS#3400 was supposed to correct the counter, so that the limit of 16 tries gets detected and then slapd shuts down. Do you have a test case to reproduce the EBADF situation in the first place, that demonstrates that the patch for ITS#3400 is still broken? (Note that the fix of ITS#3400 is part of 2.2.19.) The patch looks interesting; probably it should break out of the loop after it detects a single bad descriptor. (It is already pretty rare to have one bad descriptor, what's the likelihood of more than one?) I haven't looked closely at the code yet, does it work with outbound connections too (e.g. syncrepl consumer)? jtownsend@opendarwin.org wrote: > Full_Name: Jason Townsend > Version: 2.2.19, HEAD > OS: Mac OS X 10.4.1 > URL: http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch > Submission from: (NULL) (17.221.43.142) > > > If slapd encounters an EBADF error in its main select loop in > servers/slapd/daemon.c, it simply ignores it and attempts the select with the > same set of file descriptors as before. It appears the intent was to quit slapd > after 16 of these errors occur, but in my experience it would cause slapd to go > into a 100% CPU usage spin since select returns immediately in the EBADF case > rather than blocking, and somehow 16 error limit was not triggered. > > I've made a change which allows slapd to find which file descriptor is bad and > remove it from the list so it can recover from this case without needing to > quit. The patch was originally developed against 2.2.19, but I've merged it into > HEAD. It probably also could apply to 2.3.x. Note that I did not create a > version for the HAVE_WINSOCK case since I don't have a Windows development > system to test that on but hopefully a similar strategy could be used there. > > http://www.opendarwin.org/~jtownsend/patches/ebadf/servers-slapd.patch > > > > . > > -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support
On Jul 11, 2005, at 2:31 PM, Howard Chu wrote: > ITS#3400 was supposed to correct the counter, so that the limit of > 16 tries gets detected and then slapd shuts down. Do you have a > test case to reproduce the EBADF situation in the first place, that > demonstrates that the patch for ITS#3400 is still broken? (Note > that the fix of ITS#3400 is part of 2.2.19.) It looks like this fix was done in before I merged in 2.2.19 actually (I know, bad Jason for only just submitting this patch now), so at the time I was running into the 100% CPU rather than an abnormal exit of slapd which is what ITS 3400 provided. Either way I think having slapd be able to recover from this on its own is better. > The patch looks interesting; probably it should break out of the > loop after it detects a single bad descriptor. (It is already > pretty rare to have one bad descriptor, what's the likelihood of > more than one?) That would be easy enough to do. If there was more than one bad descriptor though you'd have to iterate over the descriptors once for each of them in that case. > I haven't looked closely at the code yet, does it work with > outbound connections too (e.g. syncrepl consumer)? We don't currently use syncrepl so it's possible this patch needs to be modified to take that into account... are those connections in the same connection pool as incoming connections? -Jason
Jason Townsend wrote: > It looks like this fix was done in before I merged in 2.2.19 actually > (I know, bad Jason for only just submitting this patch now), so at > the time I was running into the 100% CPU rather than an abnormal exit > of slapd which is what ITS 3400 provided. Either way I think having > slapd be able to recover from this on its own is better. Well yes, ordinarily I would agree. But this is one of those "should never happen" situations and since we haven't got a reliable means of reproducing it, we really have no idea what it means when the situation does occur. The only time I can recall seeing it was with someone's custom back-perl script that was closing and dup'ing descriptors without mutexes. Their code was doing something like x = open(some file); close(fd); dup2( x, fd ); (I have no idea why.) There is a race condition where the descriptor got closed and was immediately re-used by a call to accept(). When their dup2 executed, they were stomping on the socket descriptor because they thought it still pointed at their flat file. (The fix is not to do the explicit close, since dup/dup2 will do that automatically and atomically.) Anyway, in cases like this of severe programmer error the connection table consistency is totally shot, and you cannot rely on it, so the safest thing is to actually bail out. > > The patch looks interesting; probably it should break out of the > > loop after it detects a single bad descriptor. (It is already > > pretty rare to have one bad descriptor, what's the likelihood of > > more than one?) > > That would be easy enough to do. If there was more than one bad > descriptor though you'd have to iterate over the descriptors once for > each of them in that case. > > > I haven't looked closely at the code yet, does it work with > > outbound connections too (e.g. syncrepl consumer)? > > We don't currently use syncrepl so it's possible this patch needs to > be modified to take that into account... are those connections in the > same connection pool as incoming connections? Yes, the consumer connections use the same connection pool, though the structures are filled a little bit dfferently. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support
On Jul 11, 2005, at 7:17 PM, Howard Chu wrote: > Well yes, ordinarily I would agree. But this is one of those > "should never happen" situations and since we haven't got a > reliable means of reproducing it, we really have no idea what it > means when the situation does occur. The only time I can recall > seeing it was with someone's custom back-perl script that was > closing and dup'ing descriptors without mutexes. Their code was > doing something like > x = open(some file); > close(fd); > dup2( x, fd ); > (I have no idea why.) There is a race condition where the > descriptor got closed and was immediately re-used by a call to > accept(). When their dup2 executed, they were stomping on the > socket descriptor because they thought it still pointed at their > flat file. (The fix is not to do the explicit close, since dup/dup2 > will do that automatically and atomically.) Anyway, in cases like > this of severe programmer error the connection table consistency is > totally shot, and you cannot rely on it, so the safest thing is to > actually bail out. I'd love to find the problem that is causing the bad file descriptor in the first place... but in the absence of that fix it seemed better to take the approach of this patch than do nothing. The script I used to reproduce this is here: http://www.opendarwin.org/~jtownsend/bindstress.pl If you start up about 6 instances of that script it is a pretty good torture test. It just runs ldapwhoami over and over with no delay in between using random DNs to bind as. It assumes all the passwords are set to the same thing. There are a few parameters at the top of the script that can be adjusted depending on the particular LDAP server in use. Typically I had at least 2000 user record when doing this test. It's entirely possible that this problem is specific to the version of OpenLDAP included with Mac OS X as we have some changes in there to support Password Server authentication, but we were never able to track down a problem within that code which was causing EBADF errors. -Jason
Jason Townsend wrote: > On Jul 11, 2005, at 7:17 PM, Howard Chu wrote: >> Well yes, ordinarily I would agree. But this is one of those "should >> never happen" situations and since we haven't got a reliable means of >> reproducing it, we really have no idea what it means when the >> situation does occur. The only time I can recall seeing it was with >> someone's custom back-perl script that was closing and dup'ing >> descriptors without mutexes. Their code was doing something like >> x = open(some file); >> close(fd); >> dup2( x, fd ); >> (I have no idea why.) There is a race condition where the descriptor >> got closed and was immediately re-used by a call to accept(). When >> their dup2 executed, they were stomping on the socket descriptor >> because they thought it still pointed at their flat file. (The fix is >> not to do the explicit close, since dup/dup2 will do that >> automatically and atomically.) Anyway, in cases like this of severe >> programmer error the connection table consistency is totally shot, >> and you cannot rely on it, so the safest thing is to actually bail out. > > I'd love to find the problem that is causing the bad file descriptor > in the first place... but in the absence of that fix it seemed better > to take the approach of this patch than do nothing. The script I used > to reproduce this is here: > > http://www.opendarwin.org/~jtownsend/bindstress.pl > > If you start up about 6 instances of that script it is a pretty good > torture test. It just runs ldapwhoami over and over with no delay in > between using random DNs to bind as. It assumes all the passwords are > set to the same thing. There are a few parameters at the top of the > script that can be adjusted depending on the particular LDAP server in > use. Typically I had at least 2000 user record when doing this test. > > It's entirely possible that this problem is specific to the version of > OpenLDAP included with Mac OS X as we have some changes in there to > support Password Server authentication, but we were never able to > track down a problem within that code which was causing EBADF errors. If your slapd has to talk to an external Password Server to validate these Binds, then I'm going to have to place the blame on that communication step. I've been unable to reproduce this EBADF situation on the plain OpenLDAP code. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
changed notes changed state Open to Suspended
changed state Suspended to Closed
changed notes
moved from Incoming to Archive.Incoming
could not dup in plain OpenLDAP Software, close? probably fixed in ITS#4338