[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Previous behavior disappears without threads ... (ITS#669)



At 08:23 PM 8/17/00 -0700, wesley.craig@umich.edu wrote:
>> From:    Kurt Zeilenga <kurt@OpenLDAP.org>
>> To:      abrock@georgefox.edu

>> Some one with Solaris thread debugging experience should look into
>> this...  thanks, Kurt

I'm willing to look into this.  Contextually important messages don't
appear to be in ITS under 669.  Somewhere else?

>:wes

I am including the text of another follow up message that I apparently never sent to the list:


Date: Mon, 14 Aug 2000 10:33:10 -0700
To: parker@bell-labs.com (parker@bell-labs.com), Kurt@OpenLDAP.org (Kurt@OpenLDAP.org)
From: Anthony Brock <abrock@mail.georgefox.edu>
Subject: Re: CVS of todays code freezing erratically ..


Thank you for the pointer! I had never heard of the 'truss' command, and it appears to have listed everything up to the freeze. However, even though I froze the ldap server very shortly after starting, I have a 1.5 meg file as output. I am including the last 100 lines below.

Unfortunately, I am at a loss as to why the ldap server is even opening a file called '/var/run/syslog_door', especially since this would have permission difficulties on a solaris box if run as a non-root user. Also, I noticed that there is now a file called syslog_door existing in the /var/run directory which is permissions 'Drw-r--r--' I have never see a permission list starting with a capital D before.

What ever the cause, it seems to occur near something relating to this file. I have run this 5 times, and it freezes at a different point each time. From the output, do you have any ideas as to what is going on?

On last item of note, the sun box only had about 5 connections listed as established to port 389 at the last time this froze, and it had about 20 listed with a state of 'TIME_WAIT'. Not nearly as many as some of the initial attempts to diagnose indicated was necessary for the process to freeze.

Thanks in advance,

Tony

The command executed was, as root:

# truss -fad /opt/ldap/libexec/slapd -f /opt/ldap/etc/openldap/slapd.conf -d 1 > truss.out 2>&1
#


to kill the command once frozen, I executed:

# kill -9 <pid number>
#
--- included output from truss.out ---
4662: 51.7048 write(2, " c o n n e c t i o n _ r".., 52) = 52
4662: 51.7047 time() = 966273361
connection_close: deferring conn=8 sd=15
4662: 51.7052 write(2, " c o n n e c t i o n _ c".., 41) = 41
4662: 51.7052 poll(0xFEE0B620, 5, -1) = 1
4662: 51.7056 yield() = 0
4662: 51.7057 read(4, " 0 0", 1024) = 2
4662: 51.7060 time() = 966273361
connection_resched: attempting closing conn=8 sd=15
4662: 51.7063 write(2, " c o n n e c t i o n _ r".., 52) = 52
connection_close: conn=8 sd=15
4662: 51.7067 write(2, " c o n n e c t i o n _ c".., 31) = 31
4662: 51.7070 fstat(3, 0xFEB05880) = 0
4662: 51.7073 time() = 966273361
4662: 51.7076 getpid() = 4662 [4661]
4662: 51.7078 putmsg(3, 0xFEB04F38, 0xFEB04F2C, 0) = 0
4662: 51.7082 open("/var/run/syslog_door", O_RDONLY) = 17
4662: 51.7085 door_info(17, 0xFEB04E70) = 0
4662: 51.7087 getpid() = 4662 [4661]
4662: 51.7089 door_call(17, 0xFEB04E58) = 0
4662: 51.7092 close(17) = 0
4662: 51.7096 close(15) = 0
4662: 51.7099 yield() = 0
4662: 51.7819 poll(0xFEE0B620, 5, -1) = 1
connection_get(16): got connid=5
4662: 51.7822 write(2, " c o n n e c t i o n _ g".., 33) = 33
4662: 51.7825 time() = 966273361
connection_read(16): checking for input on id=5
4662: 51.7827 write(2, " c o n n e c t i o n _ r".., 48) = 48
ber_get_next
4662: 51.7830 write(2, " b e r _ g e t _ n e x t".., 13) = 13
4662: 51.7832 read(16, " 005020106 B\0", 16384) = 7
ber_get_next: tag 0x30 len 5 contents:
4662: 51.7836 write(2, " b e r _ g e t _ n e x t".., 39) = 39
4662: 51.7838 time() = 966273361
4662: 51.7840 lwp_sema_post(0xFED09E30) = 0
4662: 51.7842 lwp_sema_wait(0xFED09E30) = 0
4662: 51.7846 lwp_mutex_lock(0xFF0E55C0) = 0
4662: 51.7846 lwp_mutex_wakeup(0xFF0E55C0) = 0
ber_get_next
4662: 51.7849 write(2, " b e r _ g e t _ n e x t".., 13) = 13
4662: 51.7853 lwp_sema_wait(0xFED09E30) = 0
4662: 51.7852 lwp_sema_post(0xFED09E30) = 0
4662: 51.7856 lwp_mutex_lock(0xFF0E55C0) = 0
4662: 51.7856 lwp_mutex_wakeup(0xFF0E55C0) = 0
do_unbind
4662: 51.7860 write(2, " d o _ u n b i n d\n", 10) = 10
4662: 51.7863 lwp_sema_post(0xFEE0BE30) = 0
4662: 51.7863 lwp_sema_wait(0xFEE0BE30) = 0
4662: 51.7866 fstat(3, 0xFED09978) = 0
4662: 51.7866 read(16, 0x019C21B8, 16384) Err#11 EAGAIN
4662: 51.7869 time() = 966273361
ber_get_next on fd 16 failed errno=11 (Resource temporarily unavailable)
4662: 51.7872 write(2, " b e r _ g e t _ n e x t".., 73) = 73
4662: 51.7875 getpid() = 4662 [4661]
4662: 51.7875 yield() = 0
4662: 51.7879 putmsg(3, 0xFED09030, 0xFED09024, 0) = 0
4662: 51.7878 time() = 966273361
4662: 51.7883 open("/var/run/syslog_door", O_RDONLY) = 15
4662: 51.7886 door_info(15, 0xFED08F68) = 0
4662: 51.7888 getpid() = 4662 [4661]
4662: 51.7891 door_call(15, 0xFED08F50) = 0
4662: 51.7893 close(15) = 0
connection_closing: readying conn=5 sd=16 for close
4662: 51.7897 write(2, " c o n n e c t i o n _ c".., 52) = 52
4662: 51.7897 poll(0xFEE0B620, 5, -1) = 1
4662: 51.7901 write(5, " 0", 1) = 1
4662: 51.7904 write(5, " 0", 1) = 1
connection_resched: attempting closing conn=5 sd=16
4662: 51.7907 write(2, " c o n n e c t i o n _ r".., 52) = 52
4662: lwp_sema_wait(0xFED09E30) (sleeping...)
4662: signotifywait() (sleeping...)
4662: lwp_sema_wait(0xFEC07E30) (sleeping...)
4662: lwp_sema_wait(0xFEE0BE30) (sleeping...)
4662: lwp_cond_wait(0xFF0E55B0, 0xFF0E55C0, 0xFF085BF0) (sleeping...)
4662: lwp_sema_wait(0xFEB05E30) (sleeping...)
4662: door_return(0x00000000, 0, 0x00000000, 0) (sleeping...)
4662: 123.3014 signotifywait() = 15
4662: 123.3018 lwp_sema_post(0xFEC07E30) = 0
4662: 123.3022 lwp_sema_wait(0xFEC07E30) = 0
4662: 123.3025 sigprocmask(SIG_SETMASK, 0xFF0EADC8, 0xFFBEFBE8) = 0
4662: 123.3028 lwp_sigredirect(3, SIGTERM, 0xFFBEFB64) = 0
4662: 123.3030 lwp_sigredirect(3, SIGTERM, 0xFFBEFB64) Err#22 EINVAL
4662: 123.3033 sigprocmask(SIG_SETMASK, 0xFFBEFBE8, 0x00000000) = 0
4662: 123.3036 Received signal #15, SIGTERM [caught]
4662: siginfo: SIGTERM pid=4681 uid=0
4662: 123.3040 sigprocmask(SIG_SETMASK, 0xFF0DEFF8, 0x00000000) = 0
slap_sig_shutdown: signal 15
4662: 123.3044 write(2, " s l a p _ s i g _ s h u".., 29) = 29
4662: 123.3048 write(5, " 0", 1) = 1
4662: 123.3052 sigprocmask(SIG_SETMASK, 0xFF0EADC8, 0x00000000) = 0
4662: 123.3065 setcontext(0xFFBEF830)
4662: lwp_sema_wait(0xFED09E30) (sleeping...)
4662: signotifywait() (sleeping...)
4662: lwp_sema_wait(0x00197FC0) (sleeping...)
4662: lwp_sema_wait(0xFEE0BE30) (sleeping...)
4662: lwp_cond_wait(0xFF0E55B0, 0xFF0E55C0, 0xFF085BF0) (sleeping...)
4662: lwp_sema_wait(0xFEB05E30) (sleeping...)
4662: door_return(0x00000000, 0, 0x00000000, 0) (sleeping...)
4662: *** process killed ***
--- end output from truss.out ---


At 11:16 AM 8/11/00 -0700, parker@bell-labs.com wrote:
Anthony Brock wrote:
>> Well, I have looked at the script, and every instance of a successful
>> connection is followed by a disconnect.  Also, this same script has no
>> problems against the older 1.2.11 server.
>>
>> The other thing of note is that there a two scripts I am using.  One had
>> some bugs, which resulted in incorrect output.  However, it also was
>very
>> careful to open and close connections properly.  Unfortunately, with
>that
>> script, after about 20 to 30 connections, the server will freeze.
>>
>> I am baffled by this, and any attempt to debug results in an output file
>> sever megabytes in size.  Also, is we wait for the server to cleanup the
>> network connections (takes about 4 minutes), the LDAP server is STILL
>> frozen and non-responsive.

Since you're running on Solaris (as I recall), bring the server up under
truss(1) and capture the output.  From this you should be able to
determine
where it is hanging, and perhaps why as well.

****************************************************************************** * Anthony Brock abrock@georgefox.edu * * Director of Network Services George Fox University * ******************************************************************************