[Date Prev][Date Next]
we have an environment with ~230 client machines (linux) and a couple of
servers (linux, solaris) that are querying 4 ldap replicas (2 linux, 2
solaris) under dns round robin. We have a lot of problems because we're
constantly hitting 1024 fds limit of select(). I'm sure that our
environment does not produce more than 4000 connections to ldap at the
I found out something interesting: lsof showed me that certain connections
live something like forever on linux replicas, but not on solaris
replicas. Ie, a certain client has 8 established connections on port 389
and 3 in time_wait, but at the same time there's 17 connections from this
same client on one linux replica server, 15 on another, but only 1 on both
solaris replicas. Why these dead connections dont timeout or something?
Could there be some interaction with some linux kernel syscall?
So, what's happening is that both linux replicas get bogged down, troubles
begin, then both solaris replicas get more and more connections, which
means more troubles ... our short term solution is to move all replicas to
solaris, until this problem is resolved, altough openldap on solaris is
noticeably slower than on linux.
And, is there really no way to dump the select() and its 1024 file
descriptors limit and use something else instead? poll() maybe?
linux replicas: openldap 2.0.11, kernel 2.2.18, bdb-2.7.7 backend.
solaris replicas: solaris7, same openldap and bdb.
linux clients: redhat 7.0, 2.4.2 kernel, nss_ldap-122-1.7,