[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#3567) slapd terminates unexpectedly



Howard, Dagobert,

> Your log shows that poll() fails with ENOSYS, which is not a documented 
> failure code for this system call. It looks like Solaris itself has 
> failed here, and you should probably contact Sun for a solution.

Maybe this is a Sun bug but as we went through here at Sendamil and 
reported in ITS 3400, the ENOSYS could mean "I don't like a return value 
I got from poll()" where poll() is use to implement select on Solaris. 
Dagobert - I suggest you repro the problem and get three pieces of doc:

1) truss with options:

truss -d -l -f -vpoll -u:: -ua.out:: -o <output_file_name> -p <pid>

This will show the select call getting mapped to poll() like the 
following. In ITS 3400, we saw that the pollset as built from the 
select() call had a bit set for an FD that Solaris thinks is closed. (My 
suspicsion is slapd lost track of FDs somehow and left one set for an FD 
that it had closed.)

E.g.,

14073/2@2:      22.1298   -> libc:select_large_fdset(0x10, 0xfe3ffb74, 
0xfe3ff774, 0x0)
14073/2@2:      22.1303     -> libc_psr:memset(0xfe3fd188, 0x0, 0x2000,0x0)
14073/2@2:      22.1309     <- libc_psr:memset() = 0xfe3fd188
14073/2@2:      22.1314     -> libc_psr:memset(0xfe3fcf80, 0x0, 0x200, 
0xfffffff8)
14073/2@2:      22.1319     <- libc_psr:memset() = 0xfe3fcf80
14073/2@2:      22.1325     -> libc:_poll(0xfe3fcf80, 0x7, 0xffffffff, 
0xfffffff8)
14073/2:        22.1330 poll(0xFE3FCF80, 7, -1) = 1
14073/2:                fd=5  ev=POLLRDNORM rev=0
14073/2:                fd=7  ev=POLLRDNORM rev=0
14073/2:                fd=8  ev=POLLRDNORM rev=0
14073/2:                fd=9  ev=POLLRDNORM rev=0
14073/2:                fd=10 ev=POLLRDNORM rev=0
14073/2:                fd=11 ev=POLLRDNORM rev=0
14073/2:                fd=15 ev=POLLRDNORM rev=POLLRDNORM
14073/2@2:      22.1337     <- libc:_poll() = 1
14073/2@2:      22.1341   <- libc:select_large_fdset() = 1

Stop this truss as soon as it looks like it's spinning so there's not 
too much data.

2) To see that possibility, run slapd (same time as you got the truss) 
with the debug level 8, which should trace slapd connection activity ala 
(this exampel is from OpenLDAP 2.1 but it should be roughly the same):

Mar  7 14:12:18 qa-karn slapd[14073]: [ID 100111 local4.debug] slapd 
starting
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 7r
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 8r
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 9r
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 10r
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 11r
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=7 active_threads=0 tvp=NULL
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=8 active_threads=0 tvp=NULL
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=9 active_threads=0 tvp=NULL
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=10 active_threads=0 tvp=NULL
Mar  7 14:12:18 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=11 active_threads=0 tvp=NULL
Mar  7 14:13:18 qa-karn slapd[14073]: [ID 454241 local4.debug] daemon: 
activity on 1 descriptors
Mar  7 14:13:19 qa-karn slapd[14073]: [ID 923062 local4.debug] daemon: 
new connection on 15
Mar  7 14:13:21 qa-karn slapd[14073]: [ID 732783 local4.debug] daemon: 
added 15r
Mar  7 14:13:22 qa-karn slapd[14073]: [ID 802679 local4.debug] daemon: 
activity on:
Mar  7 14:13:23 qa-karn slapd[14073]: [ID 100000 local4.debug]
Mar  7 14:13:23 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=7 active_threads=0 tvp=NULL
Mar  7 14:13:24 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=8 active_threads=0 tvp=NULL
Mar  7 14:13:25 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=9 active_threads=0 tvp=NULL
Mar  7 14:13:26 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=10 active_threads=0 tvp=NULL
Mar  7 14:13:26 qa-karn slapd[14073]: [ID 538834 local4.debug] daemon: 
select: listen=11 active_threads=0 tvp=NULL
Mar  7 14:13:27 qa-karn slapd[14073]: [ID 454241 local4.debug] daemon: 
activity on 1 descriptors
Mar  7 14:13:28 qa-karn slapd[14073]: [ID 802679 local4.debug] daemon: 
activity on: 15r

3) When it spins, take a core dump of slapd with the gcore command.

If you can reproduce this easily, this doc may help see if it's the same 
sort of "descriptor is set when it's not supposed to be" issue that I 
saw and reported in ITS 3400.

I'll help Howard look at the doc if you can collect it (and if he 
wants/needs my help).

Chris