[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#3384) slapd stops forking after the forked processes are hung on read(stdin)



Full_Name: Tho Tran
Version: 2.2.17
OS: FC2
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (192.128.134.68)


Issues: 

The shell-database programs which is written in Python are hung on
sys.stdin.readlines() after they are forked by slapd.   This happens after the
slapd server is stressed with multiple ldapsearch client applications. 
When the shell-database application are hung on read, slapd continues to process
ldap requests but it stops forking the shell-database program. 
 

Environment:

LINUX Fedora-core-2 
openLDAP 2.2.17 and Python 2.3.3. 
slapd.conf is configured to have up to 128 threads. 
The LDAP client can run on the same or another machine.  The client is a shell
program like this. 
 

desthost=fllinux1
cnt=0
nloop=100
echo "Authenticating test1@att.com" > test1.log

while [ $cnt -lt $nloop ]
do
        sleep 0
        cnt=`expr $cnt + 1`

        ldapsearch -h $desthost -x -v -b 'dc=openldap,dc=com'
'email=test1@att.com'

        rc=`ldapsearch -h $desthost -x -D
'cn=test1@att.com,ou=people,dc=openldap,dc=com' -w 'password'`

        if [ ! "$rc" ]
        then
                echo "`date`: loop $cnt: Bind test1@att.com FAIL" >> test1.log
        else
                echo "`date`: loop $cnt: Bind test1@att.com OK" >> test1.log
        fi
done

The LDAP server is configured to use shell for database.  The shell program is
written in Python, search.py.  For each LDAP search or bind request, slapd forks
the search.py program.  The search.py performs read using sys.std.readlines()
and parses the LDAP request.  For the search request, search.py simply
constructs a proper LDAP response and sends the reply via stdout using Python
print statement.  For the bind request, the application contacts another server
via HTTP to valid the entity, constructs a proper LDAP response, and sends the
reply via stdout.  (For the bind request in the test, I ran both cases, one
contacting another server via HTTP to validate the request and one without
contacting another server.  In both cases, I encounter the same issues. 
However, based on my observation, the issues surface sooner and with less
clients when the code that processes the bind request does not contact the
remote server for validating data.   I doubt this has anything to do with the
root cause of the issues.) 
 

Test environments:
After running 3 to 4 client applications for a short while, two search.py
processes are hung on read(0, ?) and slapd stops forking new search.py
processes.  However, slapd continues to process new LDAP requests according to
/var/log/ldap.log. 
When search.py processes are hung, lsof ­i :389 shows that there are only 11
processes having access to port 389. 
Here is the output from ?ps ­ef | grep search? showing the hung processes. 
 

[root@fllinux1 openldap]#
[root@fllinux1 openldap]# ps -ef | grep search
ldap     23044 21440  0 13:55 ?        00:00:00 python
/usr/local/etc/openldap/search.py
ldap     23045 21732  0 13:55 ?        00:00:00 python
/usr/local/etc/openldap/search.py
root     23174 21236  0 14:17 pts/15   00:00:00 grep search

Here is the output from ?lsof? output. 
[root@fllinux1 openldap]# lsof -i :389

BAD CASES:
COMMAND   PID USER   FD   TYPE DEVICE SIZE NODE NAME
slapd   21422 ldap    7u  IPv4  93404       TCP *:ldap (LISTEN)
slapd   21422 ldap    8u  IPv4 101998       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45050 (ESTABLISHED)
slapd   21422 ldap    9u  IPv4 102005       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45052 (ESTABLISHED)
slapd   21422 ldap   10u  IPv4 102083       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45053 (ESTABLISHED)
python  23044 ldap    7u  IPv4  93404       TCP *:ldap (LISTEN)
python  23044 ldap    8u  IPv4 101998       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45050 (ESTABLISHED)
python  23044 ldap    9u  IPv4 102005       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45052 (ESTABLISHED)
python  23045 ldap    7u  IPv4  93404       TCP *:ldap (LISTEN)
python  23045 ldap    8u  IPv4 101998       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45050 (ESTABLISHED)
python  23045 ldap    9u  IPv4 102005       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45052 (ESTABLISHED)

The lsof shows that each search.py process (pid 23044 and pid 23045) has two fds
opened on port 389.  Is this right?  There should only be one fd on port 389.

In the normal case, there should be only one fd opened on port 389.  See lsof
data below.
GOOD CASES:

COMMAND   PID USER   FD   TYPE DEVICE SIZE NODE NAME
slapd   21422 ldap    7u  IPv4  93404       TCP *:ldap (LISTEN)
slapd   21422 ldap    8u  IPv4 104357       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45109 (ESTABLISHED)
python  23426 ldap    7u  IPv4  93404       TCP *:ldap (LISTEN)
python  23426 ldap    8u  IPv4 104357       TCP
fllinux1.web.att.com:ldap->ibm-linux-1:45109 (ESTABLISHED)

Here is the output from strace of two hung processes 
[root@fllinux1 openldap]#
[root@fllinux1 openldap]# strace -p 23044
Process 23044 attached - interrupt to quit
read(0,  <unfinished ...>
Process 23044 detached
[root@fllinux1 openldap]# strace -p 23045
Process 23045 attached - interrupt to quit
read(0,  <unfinished ...>
Process 23045 detached
[root@fllinux1 openldap]#

Here is the output from /var/log/ldap.log showing that slapd continues to
process incoming request but it does not want to fork more search.py
application.  The ldap.log did not show forking activity.  Because I did not see
any new search.py processes and my search.py application log did not show new
activities, I inferred that slapd did not fork new application to handle the new
ldap request. 
 
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: activity on 1 descriptors
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: new connection on 16
Nov  4 15:20:10 fllinux1 slapd[21422]: conn=927 fd=16 ACCEPT from
IP=135.16.76.143:45059 (IP=0.0.0.0:389)
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: added 16r
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: activity on:
Nov  4 15:20:10 fllinux1 slapd[21422]:
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: select: listen=6 active_threads=0
tvp=NULL
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: select: listen=7 active_threads=0
tvp=NULL
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: activity on 1 descriptors
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: activity on:
Nov  4 15:20:10 fllinux1 slapd[21422]:  16r
Nov  4 15:20:10 fllinux1 slapd[21422]:
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: read activity on 16
Nov  4 15:20:10 fllinux1 slapd[21422]: connection_get(16)
Nov  4 15:20:10 fllinux1 slapd[21422]: connection_get(16): got connid=927
Nov  4 15:20:10 fllinux1 slapd[21422]: connection_read(16): checking for input
on id=927
Nov  4 15:20:10 fllinux1 slapd[21422]: ber_get_next on fd 16 failed errno=11
(Resource temporarily unavailable)
Nov  4 15:20:10 fllinux1 slapd[21422]: do_bind
Nov  4 15:20:10 fllinux1 slapd[21422]: >>> dnPrettyNormal:
<cn=admin,ou=people,dc=openldap,dc=com>
Nov  4 15:20:10 fllinux1 slapd[21422]: <<< dnPrettyNormal:
<cn=admin,ou=people,dc=openldap,dc=com>,
<cn=admin,ou=people,dc=openldap,dc=com>
Nov  4 15:20:10 fllinux1 slapd[21422]: do_bind: version=3
dn="cn=admin,ou=people,dc=openldap,dc=com" method=128
Nov  4 15:20:10 fllinux1 slapd[21422]: conn=927 op=0 BIND
dn="cn=admin,ou=people,dc=openldap,dc=com" method=128
Nov  4 15:20:10 fllinux1 slapd[21422]: => access_allowed: auth access to
"cn=admin,ou=people,dc=openldap,dc=com" "entry" requested
Nov  4 15:20:10 fllinux1 slapd[21422]: => access_allowed: backend default auth
access granted to "(anonymous)"
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: select: listen=6 active_threads=0
tvp=NULL
Nov  4 15:20:10 fllinux1 slapd[21422]: daemon: select: listen=7 active_threads=0
tvp=NULL