[Date Prev][Date Next] [Chronological] [Thread] [Top]

Race condition?



I wanted to get people's thoughts on a problem I've come across.  (The
issue was first logged in the bug database under ITS#190, in case you'd
like a little bit more info.  A change has been made since the to attempt
to fix the problem, however, my further testing seems to indicate that it
is still present.)

I'm running slapd.exe build with the latest set of source as of morning of
6/4/99.  The OS is NT 4.0, SP3.  I'm running ldapsearch, telling it to dump
the entire contents of my directory (< 50 objects).  ldapsearch is
currently run from a script executing it 100 times in order to simplify
reproduction of the problem; however, I first discovered it when running
ldapsearch once from the command line.

The symptom that I'm seeing is an assertion failure in daemon.c, line 569. 
The cause may be a synchronization discrepancy leading to a race condition
on socket sets.  Below is a dump of the last few lines of the slapd trace,
up to the assert:

send_ldap_result 0::
daemon: activity on 1 descriptors
daemon: activity on: 868r
daemon: read activity on 868
connection_get(868): got connid=12
connection_read(868): checking for input on id=12
ber_get_next
ber_get_next: tag 0x30 len 5 contents:
ber_dump: buf 0x1146460, ptr 0x1146460, end 0x1146465
        02 01 03  B 00
ber_get_next
ber_get_next on fd 868 failed errno 10035 (unknown)
do_unbind
        *** got 18113692 of 0 so far
daemon: select: tcps=16 active_threads=1 tvp=NULL
connection_closing: readying conn=12 sd=868 for close.
daemon: activity on 1 descriptors
connection_resched: attempting closing conn=12 sd=868.
daemon: activity on:connection_close: conn=12 sd=868.
 868rdaemon: removing 868

daemon: read activity on 868
Assertion failed: FD_ISSET( rd, &slap_daemon.sd_actives), file
C:\ldap\servers\slapd\daemon.c, line 569

These are the locals in slapd_daemon_task where the assert happens:
	rd	868
	nfds	64
+	from	{...}
	i	0
-	writefds	{...}
		fd_count	0
	+	fd_array	0x0165fd5c
+	zero	{...}
+	tvp	0x00000000
+	client_addr	0x0048094c "127.0.0.1"
+	client_name	0x00000000 ""
-	readfds	{...}
		fd_count	1
	-	fd_array	0x0165fe74
		[0]	868
		[1]	680
		[2]	692
		[3]	712
		[4]	732
		[5]	816
		[6]	796
		[7]	880
		[8]	864
		[9]	868
[snip -- the rest are 0's]
	[63]	0
	ns	1
	ptr	0x01146f20
	tcps	16
	inetd	0

You'll notice that at the end of the trace, there's a number of
'interleaved' messages having to do with socket 868--the same (and only)
socket that is in readfds.  To me this indicates that there's another
thread that is accessing the same socket and one or more descriptor sets
that contain it.  At the top of the file, where readfds and writefds are
set up from slap_daemon's descriptor sets, the operation is protected by
slap_daemon.sd_mutex.  In the triggered assert, on the other hand, no mutex
locking is done.  Therefore, when another thread clears the socket from the
'actives' description set in slapd_remove, this fact is left unnoticed by
the asserting code below.  I'm pretty new to the code (today is only my
second day of looking at it) and so I don't have a good solution to the
problem.  Perhaps that assert could be replaced by an 'if' that would
contain the following few lines (connection_read/slapd_close) in its body
and have the whole thing sandwiched in a  slap_daemon.sd_mutex-protected
block?  I don't know; I'm sure there are better ways of dealing with the
problem....  I'm just passing my finds to the people in the know in the
hope that the info may be helpful.

Please let me know if I can be of any further help with this.

Regards,

Alex Zeltser