Full_Name: Quanah Gibson-Mount Version: HEAD OS: Solaris 8 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (171.64.19.82) The recent Light Weight Listener import is causing core dumps in HEAD, which limits the usability testing I'd like to do. Backtrace: #0 0xfee1f2b4 in _lwp_kill () from /usr/lib/libc.so.1 (gdb) bt #0 0xfee1f2b4 in _lwp_kill () from /usr/lib/libc.so.1 #1 0xfedcbcf4 in raise () from /usr/lib/libc.so.1 #2 0xfedb598c in abort () from /usr/lib/libc.so.1 #3 0xfedb5c30 in _assert () from /usr/lib/libc.so.1 #4 0x00022090 in slapd_clr_read (s=13, wake=0) at daemon.c:543 #5 0x00027b88 in connection_read_activate (s=-1639983344) at connection.c:1440 #6 0x00024aa0 in slapd_daemon_task (ptr=0xffc00) at daemon.c:1991 #7 0xfed64c44 in _lwp_start () from /usr/lib/lwp/libthread.so.1 #8 0xfed64c44 in _lwp_start () from /usr/lib/lwp/libthread.so.1 Howard reports encountering the same issue. --Quanah
Using this slight modification of Jason's bindstress patch in ITS#3855 I can reproduce the crash quite rapidly. On my system (Linux x86_64) it tends to crash much sooner using epoll than using select, but in both cases it is the same assert failing. --- bindstress.pl 2005-07-11 22:14:32.000000000 -0700 +++ binds2.pl 2005-10-28 14:40:39.000000000 -0700 @@ -2,22 +2,29 @@ # bindstress.pl - free software by Matt Richard # edit the variables below to reflect your LDAP server -my $server = "server"; +my $server = "ldap://:9000"; my $uid_base = "cn=users,dc=example,dc=com"; my $bind_pass = "abc123"; -my $filter = "(!(|(uid=system)(uid=root)))"; +my $filter = "(&(objectclass=person)(!(|(uid=system)(uid=root))))"; +my $rdn = ""; # search for users in the database -my @search = `ldapsearch -x -LLL -A -b "$uid_base" -h "$server" "$filter" dn`; +my @search = `ldapsearch -x -LLL -A -b "$uid_base" -H "$server" "$filter" dn`; # convert search results into a list of users for $searchline (@search) { - @dn = split(/[=,\ ]/, $searchline); + $searchline =~ s/^dn: //; + + @dn = split(/[=,]/, $searchline); - if ($dn[2] ne "") + if ($rdn eq "") + { + $rdn = $dn[0]; + } + if ($dn[1] ne "") { - push (@users, $dn[2]); + push (@users, $dn[1]); } } @@ -28,7 +35,7 @@ my $uid = @users [ rand @users ]; # define the ldap bind command - $cmd = "ldapwhoami -x -h $server -D \"uid=$uid,$uid_base\" -w \"$bind_pass\ ""; + $cmd = "ldapwhoami -x -H $server -D \"$rdn=$uid,$uid_base\" -w \"$bind_pass \""; # get the current time my $date = `date \"+%H:%M:%S\"`; quanah@stanford.edu wrote: > Full_Name: Quanah Gibson-Mount > Version: HEAD > OS: Solaris 8 > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (171.64.19.82) > > > The recent Light Weight Listener import is causing core dumps in HEAD, which > limits the usability testing I'd like to do. Backtrace: > > #0 0xfee1f2b4 in _lwp_kill () from /usr/lib/libc.so.1 > (gdb) bt > #0 0xfee1f2b4 in _lwp_kill () from /usr/lib/libc.so.1 > #1 0xfedcbcf4 in raise () from /usr/lib/libc.so.1 > #2 0xfedb598c in abort () from /usr/lib/libc.so.1 > #3 0xfedb5c30 in _assert () from /usr/lib/libc.so.1 > #4 0x00022090 in slapd_clr_read (s=13, wake=0) at daemon.c:543 > #5 0x00027b88 in connection_read_activate (s=-1639983344) at connection.c:1440 > #6 0x00024aa0 in slapd_daemon_task (ptr=0xffc00) at daemon.c:1991 > #7 0xfed64c44 in _lwp_start () from /usr/lib/lwp/libthread.so.1 > #8 0xfed64c44 in _lwp_start () from /usr/lib/lwp/libthread.so.1 > > > Howard reports encountering the same issue. > > --Quanah > > > > > -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
Howard Chu wrote: > Using this slight modification of Jason's bindstress patch in ITS#3855 > I can reproduce the crash quite rapidly. On my system (Linux x86_64) > it tends to crash much sooner using epoll than using select, but in > both cases it is the same assert failing. > It appears that the problem existed before. I was able to reproduce the same crash against 2.3.11 after running the bindstress script for several minutes. I.e., it doesn't happen as quickly in 2.3.11, but it eventually hits the same assert. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
moved from Incoming to Development
--On Friday, October 28, 2005 3:09 PM -0700 Howard Chu <hyc@symas.com> wrote: > Howard Chu wrote: >> Using this slight modification of Jason's bindstress patch in ITS#3855 >> I can reproduce the crash quite rapidly. On my system (Linux x86_64) >> it tends to crash much sooner using epoll than using select, but in >> both cases it is the same assert failing. >> > It appears that the problem existed before. I was able to reproduce the > same crash against 2.3.11 after running the bindstress script for several > minutes. I.e., it doesn't happen as quickly in 2.3.11, but it eventually > hits the same assert. You are filling me with cheer... :P I think we need a 2.3.12 release soon anyway. ;) --Quanah -- Quanah Gibson-Mount Principal Software Developer ITSS/Shared Services Stanford University GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html "These censorship operations against schools and libraries are stronger than ever in the present religio-political climate. They often focus on fantasy and sf books, which foster that deadly enemy to bigotry and blind faith, the imagination." -- Ursula K. Le Guin
quanah@stanford.edu wrote: > --On Friday, October 28, 2005 3:09 PM -0700 Howard Chu <hyc@symas.com> > wrote: > > >> Howard Chu wrote: >> >>> Using this slight modification of Jason's bindstress patch in ITS#3855 >>> I can reproduce the crash quite rapidly. On my system (Linux x86_64) >>> it tends to crash much sooner using epoll than using select, but in >>> both cases it is the same assert failing. >>> >>> >> It appears that the problem existed before. I was able to reproduce the >> same crash against 2.3.11 after running the bindstress script for several >> minutes. I.e., it doesn't happen as quickly in 2.3.11, but it eventually >> hits the same assert. >> > > You are filling me with cheer... :P > > I think we need a 2.3.12 release soon anyway. ;) So it turns out there were a few race conditions and cases of trying to close the same connection twice. (Which was bad news, because the descriptor could get reused immediately after the first close.) I believe this is now fixed in HEAD. Please test. Heavily. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
changed notes changed state Open to Test
moved from Development to Software Bugs
hyc@symas.com wrote: > So it turns out there were a few race conditions and cases of trying to > close the same connection twice. (Which was bad news, because the > descriptor could get reused immediately after the first close.) > I believe this is now fixed in HEAD. Please test. Heavily. Running slapd on a single processor machine, I no longer see any asserts. Running on my dual machine, if syslog is enabled, I occasionally see an assert in slapd_set_read() called from connection_read(). With syslog disabled, this problem doesn't occur. No bright ideas come to mind at the moment... -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
hyc@symas.com wrote: > Running slapd on a single processor machine, I no longer see any > asserts. Running on my dual machine, if syslog is enabled, I > occasionally see an assert in slapd_set_read() called from > connection_read(). With syslog disabled, this problem doesn't occur. No > bright ideas come to mind at the moment... Some context: Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87548 op=1 UNBIND Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87548 fd=15 closed Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 fd=15 ACCEPT from IP=192.168.1.22:23726 (IP=0.0.0.0:9000) Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 op=0 BIND dn="cn=Sandi Couto,ou=peons,o=foo" method=128 Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 op=0 BIND dn="cn=Sandi Couto,ou=Peons,o=foo" mech=SIMPLE ssf=0 Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 op=0 RESULT tag=97 err=0 text= Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 op=1 UNBIND Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87549 fd=15 closed Oct 29 04:56:14 mandolin slapd.head[10328]: conn=87550 fd=15 ACCEPT from IP=192.168.1.22:23727 (IP=0.0.0.0:9000) daemon: listen=11, new connection on 15 daemon: added 15r daemon: activity on 2 descriptors daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 active_threads=0 tvp=NULL daemon: activity on 1 descriptor daemon: activity on: 15r daemon: read active on 15 daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 active_threads=0 tvp=NULL daemon: activity on 2 descriptors daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 active_threads=0 tvp=NULL daemon: activity on 1 descriptor daemon: activity on: 15r daemon: read active on 15 daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 active_threads=0 tvp=NULL daemon: activity on 1 descriptor daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 active_threads=0 tvp=NULL daemon: activity on 2 descriptors daemon: activity on: 15r daemon: read active on 15 daemon: select: listen=10 active_threads=0 tvp=NULL daemon: select: listen=11 busy daemon: removing 15 daemon: listen=11, new connection on 15 slapd.head: ../../../head/servers/slapd/daemon.c:567: slapd_set_read: Assertion `((slap_daemon.sd_index[(s)]) != -1)' failed. daemon: activity on 1 descriptor Program received signal SIGABRT, Aborted. [Switching to Thread 1090517344 (LWP 10355)] 0x00002aaaab6e76cd in raise () from /lib64/tls/libc.so.6 (gdb) bt #0 0x00002aaaab6e76cd in raise () from /lib64/tls/libc.so.6 #1 0x00002aaaab6e8c7e in abort () from /lib64/tls/libc.so.6 #2 0x00002aaaab6e1420 in __assert_fail () from /lib64/tls/libc.so.6 #3 0x000000000042e7a3 in slapd_set_read (s=15, wake=1) at ../../../head/servers/slapd/daemon.c:567 #4 0x00000000004369bf in connection_read (s=15, op=0x40fff480) at ../../../head/servers/slapd/connection.c:1665 #5 0x0000000000435efe in connection_read_thread (ctx=0x40fff4c0, argv=0xf) at ../../../head/servers/slapd/connection.c:1419 #6 0x0000000000556184 in ldap_int_thread_pool_wrapper (xpool=0x788710) at ../../../head/libraries/libldap_r/tpool.c:615 #7 0x00002aaaab5ab649 in start_thread () from /lib64/tls/libpthread.so.0 #8 0x00002aaaab7716c3 in thread_start () from /lib64/tls/libc.so.6 (gdb) info thr 10 Thread 1149237600 (LWP 10362) 0x00002aaaab5ad55a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 9 Thread 1140848992 (LWP 10361) 0x00002aaaab5ad55a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 8 Thread 1132460384 (LWP 10360) 0x00002aaaab5ad55a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 7 Thread 1124071776 (LWP 10359) 0x00002aaaab5ad55a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 6 Thread 1115683168 (LWP 10358) 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 5 Thread 1107294560 (LWP 10357) 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 4 Thread 1098905952 (LWP 10356) 0x00002aaaab5ad55a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 * 3 Thread 1090517344 (LWP 10355) 0x00002aaaab6e76cd in raise () from /lib64/tls/libc.so.6 2 Thread 1082128736 (LWP 10337) 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 1 Thread 46912511001792 (LWP 10328) 0x00002aaaab5abd2f in pthread_join () from /lib64/tls/libpthread.so.0 (gdb) thr 6 [Switching to thread 6 (Thread 1115683168 (LWP 10358))]#5 0x000000000043108a in slap_listener (sl=0x2aaaab71c520) at ../../../head/servers/slapd/daemon.c:1519 1519 slapd_add( s, 1, NULL ); (gdb) bt #0 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000000000 in ?? () #2 0x0000000000000008 in ?? () #3 0x00002aaaab5ac877 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0 #4 0x00002aaaab76ddc8 in syslog () from /lib64/tls/libc.so.6 #5 0x000000000043108a in slap_listener (sl=0x2aaaab71c520) at ../../../head/servers/slapd/daemon.c:1519 (gdb) thr 5 [Switching to thread 5 (Thread 1107294560 (LWP 10357))]#0 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 (gdb) bt 20 #0 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000766fe8 in slap_counters () #2 0x0000000000000000 in ?? () #3 0x00002aaaab5ac877 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0 #4 0x000000000082e7c0 in ?? () #5 0x0000000041fff290 in ?? () #6 0x000000000058a3e9 in ber_memcalloc (n=46912527370128, s=1107293072) at ../../../head/libraries/liblber/memory.c:306 #7 0x00000000004593b9 in do_bind (op=0x2aaaadf02f20, rs=0x41fff3e0) at ../../../head/servers/slapd/bind.c:54 #8 0x0000000000435a98 in connection_operation (ctx=0x41fff4c0, arg_v=0x2aaaadf02f20) at ../../../head/servers/slapd/connection.c:1292 #9 0x0000000000556184 in ldap_int_thread_pool_wrapper (xpool=0x788710) at ../../../head/libraries/libldap_r/tpool.c:615 #10 0x00002aaaab5ab649 in start_thread () from /lib64/tls/libpthread.so.0 #11 0x00002aaaab7716c3 in thread_start () from /lib64/tls/libc.so.6 (gdb) thr 2 [Switching to thread 2 (Thread 1082128736 (LWP 10337))]#0 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 (gdb) bt 20 #0 0x00002aaaab5af8eb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000000000 in ?? () #2 0x0000000000000000 in ?? () #3 0x00002aaaab5ac877 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0 #4 0x0000000000000001 in ?? () #5 0x00000000005962ce in __PRETTY_FUNCTION__.11610 () #6 0x0000000000000000 in ?? () #7 0x0000000000435ed3 in connection_client_stop (s=0) at ../../../head/servers/slapd/connection.c:1404 #8 0x00000000004319a1 in slapd_daemon_task (ptr=0x0) at ../../../head/servers/slapd/daemon.c:1734 #9 0x00002aaaab5ab649 in start_thread () from /lib64/tls/libpthread.so.0 #10 0x00002aaaab7716c3 in thread_start () from /lib64/tls/libc.so.6 #11 0x0000000000000000 in ?? () #12 0x0000000000000000 in ?? () #13 0x0000000000000000 in ?? () #14 0x0000000000000000 in ?? () #15 0x0000000000000000 in ?? () #16 0x0000000000000000 in ?? () #17 0x0000000000000000 in ?? () #18 0x0000000000000000 in ?? () #19 0x0000000000000000 in ?? () (More stack frames follow...) (gdb) frame 8 #8 0x00000000004319a1 in slapd_daemon_task (ptr=0x0) at ../../../head/servers/slapd/daemon.c:1734 1734 ldap_pvt_thread_mutex_lock( &slap_daemon.sd_mutex ); (gdb) thr 3 [Switching to thread 3 (Thread 1090517344 (LWP 10355))]#0 0x00002aaaab6e76cd in raise () from /lib64/tls/libc.so.6 (gdb) frame 4 #4 0x00000000004369bf in connection_read (s=15, op=0x40fff480) at ../../../head/servers/slapd/connection.c:1665 1665 slapd_set_read( s, 1 ); (gdb) p *c $13 = {c_struct_state = 2, c_conn_state = 3, c_conn_idx = 3, c_close_reason = 0x596d1b "?", c_mutex = {__m_reserved = 2, __m_count = 0, __m_owner = 0x100002873, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, c_sb = 0x2aaaadc02d90, c_starttime = 1130586974, c_activitytime = 1130586974, c_connid = 87550, c_peer_domain = {bv_len = 7, bv_val = 0x2aaaadf03460 "unknown"}, c_peer_name = {bv_len = 21, bv_val = 0x2aaaadf034e0 "IP=192.168.1.22:23727"}, c_listener = 0x76a260, c_sasl_bind_in_progress = 0, c_sasl_bind_mech = {bv_len = 0, bv_val = 0x0}, c_sasl_dn = {bv_len = 0, bv_val = 0x0}, c_sasl_authz_dn = {bv_len = 0, bv_val = 0x0}, c_authz_backend = 0x0, c_authz_cookie = 0x0, c_authz = { sai_method = 128, sai_mech = {bv_len = 0, bv_val = 0x0}, sai_dn = { bv_len = 0, bv_val = 0x0}, sai_ndn = {bv_len = 0, bv_val = 0x0}, sai_ssf = 0, sai_transport_ssf = 0, sai_tls_ssf = 0, sai_sasl_ssf = 0}, c_protocol = 0, c_ops = {stqh_first = 0x2aaaadf02f20, stqh_last = 0x2aaaadf03078}, c_pending_ops = {stqh_first = 0x0, stqh_last = 0x2aaaac8788a8}, c_write_mutex = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, c_write_cv = {__c_lock = {__status = 8589934592, __spinlock = 1}, __c_waiting = 0x1, __padding = "\001\000\000\000\000\000\000\000\220\207\207»*\000", __align = 0}, c_currentber = 0x2aaaadc00930, c_writewaiter = 0, c_is_tls = 0, c_needs_tls_accept = 0, c_sasl_layers = 0, c_sasl_done = 0, c_sasl_authctx = 0x2aaaadf00b40, c_sasl_sockctx = 0x0, c_sasl_extra = 0x2aaaadf03630, c_sasl_bindop = 0x0, c_pagedresults_state = { ps_be = 0x0, ps_size = 0, ps_cookie = 0, ps_count = 0}, c_n_ops_received = 1, c_n_ops_executing = 1, c_n_ops_pending = 0, c_n_ops_completed = 0, c_n_get = 1, c_n_read = 1, c_n_write = 0, c_pb = 0x0, c_extensions = 0x0, c_clientfunc = 0, c_clientarg = 0x0, c_send_ldap_result = 0x4490d9 <slap_send_ldap_result>, c_send_search_entry = 0x449e34 <slap_send_search_entry>, c_send_search_reference = 0x44c0ac <slap_send_search_reference>, c_send_ldap_extended = 0x449973 <slap_send_ldap_extended>, c_send_ldap_intermediate = 0x449c15 <slap_send_ldap_intermediate>} (gdb) What really doesn't make sense here is that the reader thread that crashed shouldn't even be running yet. I.e., the debug output shows that fd 15 was just removed from slapd control. Then a new connection was accepted, using that same fd. But the listener thread hasn't added 15 back into slapd control yet - that's the slapd_add() call that happens right after the StatsLog invocation. So, fd 15 really is still invalid at this point. But since it is invalid, there should not be any active reader threads using that descriptor yet. Notice that thread 5 has already read a Bind operation from that fd, and the conn->c_conn_state is SLAP_C_BINDING. Again, it should not be possible for a reader thread to have already gotten this far, since the accepting thread hasn't enabled the descriptor yet. The only possibility that comes to mind is that this reader thread is leftover from the previous invocation. I.e., it was submitted into the pool while the previous connection was active, and the connection was closed before the thread got to run. Since the listener thread has just called connection_init(), the conn structure is valid when the reader thread runs. But since the listener hasn't called slapd_add() yet, the socket is still invalid. Moving the slapd_add() so that it occurs while the conn structure is locked appears to fix this problem. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
hyc@symas.com wrote: > So it turns out there were a few race conditions and cases of trying to > close the same connection twice. (Which was bad news, because the > descriptor could get reused immediately after the first close.) > I believe this is now fixed in HEAD. Please test. Heavily. One more footnote - testing this bug using tests/progs/slapd-bind freezes up my kernel when running slapd with debug output on the console. It seems to be safe when running from an ssh session. I also run out of socket resources on the server pretty quickly if the bind client is on the same machine as the server. Otherwise, running remote clients, it can run through hundreds of thousands of connections without any trouble. (Server was running Linux 2.6.13, now 2.6.14, same lockup occurs on both. The Magic SysRq doesn't work, the kernel is totally locked up, no way to get any diagnostics or recover at that point.) -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc OpenLDAP Core Team http://www.openldap.org/project/
changed notes changed state Test to Closed
moved from Software Bugs to Archive.Software Bugs
HEAD only fixed