[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Problem unexpected failing slapd



Ruud Baart wrote:
Problem:
For a customer we use LDAP for many years. Last year suddenly the slapd
service just stopped without any traces in the logfiles. After a restart
of slapd everything works fine again. But the problem was there: it was
not an incident, now and then slapd just stops and always without any
traces in the logfiles. Sometime three times a day, sometime a week
without a failure. I can't find a pattern or any relation to any other
service on the linux server.

Attach to the running slapd with gdb, type
	handle all nostop
	continue
and let it run. If there's a crash you'll see what happened in gdb.

Environment:
- Several (debian squeeze) servers , several windows servers. We use bdb
database backend.
- There is one master LDAP server which provides syncprov and two
replica's LDAP servers (syncrepl). The master server is most intens used
(mainly samba as primary domain controller: a few hundred useraccounts,
lot of groupaccounts, workstations, acl's, etc.), one of the replica's
is not very busy but handles the mail for all users (lookup: amavis,
postfix, courier-imap, mailaccount settings etc). The third replica is
not busy at all, it is a remote location.
- Total LDAP is 3700 dn's, slapcat produces a file of 7,3 Mb.
- It is only the master LDAP with stops suddenly. I have never seen a
failure of a replica LDAP.

Because I have no clear idea about the problem I have no idea which
technical details are relevant:
DB_CONFIG
===========
set_cachesize 0 10485760 1
set_lk_max_objects 10000
set_lk_max_locks 10000
set_lk_max_lockers 10000
set_lg_dir /home/ldap-dbd
The database is stored on a ext3 filesystem, kernel  2.6.32. The server
has no problems, plenty of memory and a fast diskarray (SAS->SATA).
Never technical problems with this server. And it worked without
problems for a long period. Nothing has changed to the environment or
the LDAP setup (except of course with the upgrade to debian squeeze but
the problem was already there).

What we have tried:
- upgrade from openldap 2..4.17 (debian lenny+backports) to openldap
2.4.23 (debian squeeze). I saw in the release notes that problems
related to syncrepl were solved. Therefor we waited for version 2.4.23
te become available in debian. This upgrade made no difference.
- reindex, rebuilt the directory. When I rebuilt the LDAP with a clean
LDIF file on the master LDAP or an other machine with ldapadd there is
not one error or warning.

The workaround for the moment:
I have written a process monitor (perl daemon) which monitors the slapd
daemon and if it suddenly stops, slapd is restarted. It is of course not
a solution but the 300 user can work. If slapd stops without a restart
within 1 minute a few hundred people can't work because samba stops working.

I would like to receive suggestions what we can do to find the problem.
Because there is no pattern, nothing in the logfiles I don't know where
to start.



--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/