Re: Problem unexpected failing slapd

Sorry, I overlooked this info:

"The server 
has no problems, plenty of memory and a fast diskarray (SAS->SATA). 
Never technical problems with this server. And it worked without 
problems for a long period." 

Which tells us that your system is on a metal box.
I am afraid you 've got a hardware problem of some sort.
I advise you to start checking all hardware components (or just replace
the box).

Regards, Kuba 

On Sun, 2011-02-27 at 12:57 +0100, Ruud Baart wrote:
> Problem:
> For a customer we use LDAP for many years. Last year suddenly the slapd 
> service just stopped without any traces in the logfiles. After a restart 
> of slapd everything works fine again. But the problem was there: it was 
> not an incident, now and then slapd just stops and always without any 
> traces in the logfiles. Sometime three times a day, sometime a week 
> without a failure. I can't find a pattern or any relation to any other 
> service on the linux server.
> Environment:
> - Several (debian squeeze) servers , several windows servers. We use bdb 
> database backend.
> - There is one master LDAP server which provides syncprov and two 
> replica's LDAP servers (syncrepl). The master server is most intens used 
> (mainly samba as primary domain controller: a few hundred useraccounts, 
> lot of groupaccounts, workstations, acl's, etc.), one of the replica's 
> is not very busy but handles the mail for all users (lookup: amavis, 
> postfix, courier-imap, mailaccount settings etc). The third replica is 
> not busy at all, it is a remote location.
> - Total LDAP is 3700 dn's, slapcat produces a file of 7,3 Mb.
> - It is only the master LDAP with stops suddenly. I have never seen a 
> failure of a replica LDAP.
> Because I have no clear idea about the problem I have no idea which 
> technical details are relevant:
> ===========
> set_cachesize 0 10485760 1
> set_lk_max_objects 10000
> set_lk_max_locks 10000
> set_lk_max_lockers 10000
> set_lg_dir /home/ldap-dbd
> The database is stored on a ext3 filesystem, kernel  2.6.32. The server 
> has no problems, plenty of memory and a fast diskarray (SAS->SATA). 
> Never technical problems with this server. And it worked without 
> problems for a long period. Nothing has changed to the environment or 
> the LDAP setup (except of course with the upgrade to debian squeeze but 
> the problem was already there).
> What we have tried:
> - upgrade from openldap 2..4.17 (debian lenny+backports) to openldap 
> 2.4.23 (debian squeeze). I saw in the release notes that problems 
> related to syncrepl were solved. Therefor we waited for version 2.4.23 
> te become available in debian. This upgrade made no difference.
> - reindex, rebuilt the directory. When I rebuilt the LDAP with a clean 
> LDIF file on the master LDAP or an other machine with ldapadd there is 
> not one error or warning.
> The workaround for the moment:
> I have written a process monitor (perl daemon) which monitors the slapd 
> daemon and if it suddenly stops, slapd is restarted. It is of course not 
> a solution but the 300 user can work. If slapd stops without a restart 
> within 1 minute a few hundred people can't work because samba stops working.
> I would like to receive suggestions what we can do to find the problem. 
> Because there is no pattern, nothing in the logfiles I don't know where 
> to start.