[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: PANIC: bdb fatal region



----- ldap@mm.st wrote:

> I am rebuilding our aging pre 2.2 openldap servers that ran  ldbm
> backend and slurpd.  We ran this setup without any issues for many
> years.
> 
> The new setup is:
> RH5
> openldap 2.3.43 (Stock RH)
> bdb backend 4.4.20 (Stock RH)
> Entries in db- about 1820
> LDIF file is about 1.2M
> Memory- Master 4GB Slave 2GB (will add two more slaves)
> 
> Database section of slapd.conf:
> database        bdb
> suffix          "o=example.com"
> rootdn          "cn=root">cn=root">cn=root">cn=root,o=example.com"
> rootpw {SSHA} .....
> cachesize 1900
> checkpoint 512 30
> directory       /var/lib/ldap
> index   objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember    
> 
> eq
> index   cn,mail,surname,givenname                                     
> 
> eq,subinitial
> 
> DB_CONFIG:
> set_cachesize 0 4153344 1
> set_lk_max_objects 1500
> set_lk_max_locks 1500
> set_lk_max_lockers 1500
> set_lg_regionmax 1048576
> set_lg_bsize 32768
> set_lg_max 131072
> set_lg_dir /var/lib/ldap
> set_flags DB_LOG_AUTOREMOVE
> 
> This new setup appeared to work great for the last 10 days or so.  I
> was
> able to authenticate clients, add records etc.  Running slapd_db_stat
> -m
> and slapd_db_stat -c seem to indicate everything was ok.  Before I
> put
> this setup into production, I got slurpd to function. Then decided to
> disable slurpd to use syncrepl in refreshonly mode. This also seemed
> to
> work fine. I'm not sure if the replication started this or not, but
> wanted to include all the events that let up to this.

Replication should not be related at all.

> I have started
> to
> get:
> bdb(o=example.com): PANIC: fatal region error detected; run recovery
> On both servers at different times.  During this time slapd continues
> to
> run which seems to confuse clients that try to use it and they will
> not
> try the other server that is listed in ldap.conf.  To recover I did:
> service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap
> start.  
> 
> I then commented all the replication stuff out in the slapd.conf and
> restarted ldap.  It will run for a while (varies 5 minutes - ?)  then
> I
> get the same errors and clients are unable to authenticate.  On one
> of
> the servers I deleted all the files (except DB_CONFIG)  and did a
> slapadd of a ldif file that I generated every night (without stopping
> slapd).

You imported while slapd was running? This is a recipe for failure. You can import to a different directory, stop slapd, and switch directories, and then start, but importing in the directory while slapd is running is a bad idea.

> Same results once I started slapd again. I have enabled
> debug
> for slapd and have not seen anything different, I attached gdb to the
> running slapd and no errors are noted.   I even copied a backup copy
> of
> slapd.conf prior to the replication settings (even though they are
> commented out) thinking that maybe something in there was causing it..
> 
> Then after several recoveries as described above the systems seem to
> be
> working again.  One has not generated the error for for over 5.5
> hours
> the other has not had any problems for 2 hours.  For some reason
> after
> that period when the errors showed up for a while, things seem to be
> working again, at least for now.  
> 
> I'm nervous about putting this into production until I can get this
> to
> function properly without these issues.  During the 10 day period
> with
> everything working good,  the slave would occasional (rarely) get the
> error and I would do a recovery, but we thought this was due to
> possible
> hardware problems.  Now I'm not so sure.
> 
> I have a monitor script that runs slapd_db_stat -m and -c every 5
> minutes and nothing seems wrong there, I far as I can tell.  I'm
> hoping
> someone can help me determine possible causes or things to look at.

I would recommend that any server that hasn't had a clean import while slapd is *NOT* running, get one. Run it for a few days, and see if you see any problems.

I have been running 2.3.43 for years (my own packages on RHEL4, then my own packages on RHEL5, now some boxes run the RHEL packages), and never seen *this* issue, with a *much* larger directory (with about 8 replicas, though not all replicas have all databases).

Usually database corruption is due to hardware failure, unclean shutdown, or finger trouble.

Regards,
Buchan