[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: 8 hours tests ends with inconsistent DB.



Hmmm.... Well I think OpenLDAP is definitely rock solid for most cases
it is used. But there are cases where problems raises. I've discovered
DB corruptions, attributes which could suddenly not be found anymore
which where present for weeks in the schema definition, transaction logs
suddenly owned by root but OpenLDAP was running as "openldap" user
and things like that. I haven't opened a case for this issues because until
now I believed that these problems are simply caused by misconfiguration
for which I'm responsible. That's why people seek help on a mailing list.
Maybe someone else have had such problems in the past and so he or
she could help quickly.

For the problems I mentioned above it now really seem's to be my own
fault. For the case of the DB corruptions I could now reproduce it. In
this case I'm loading 500.000 entries with ldapadd into the directory. An
entry consists of about 23 attributes. After about 440.000 entries I
get the following messeages:
....
conn=1 op=442515 ADD dn="uid=442515,ou=icpuser,l=root"
conn=1 op=442515 RESULT tag=105 err=0 text=
conn=1 op=442516 ADD dn="uid=442516,ou=icpuser,l=root"
bdb(l=root): malloc: Cannot allocate memory: 1147
free(): invalid pointer 0x925b44c8!
conn=1 op=442516 RESULT tag=105 err=80 text=entry store failed
conn=1 op=442517 ADD dn="uid=442517,ou=icpuser,l=root"
bdb(l=root): malloc: Cannot allocate memory: 32768
free(): invalid pointer 0x925b48c8!
bdb(l=root): PANIC: Cannot allocate memory
free(): invalid pointer 0x925b4978!
slapd shutdown: waiting for 1 threads to terminate
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb(l=root): PANIC: fatal region error detected; run recovery

Now I'm doing a "db_recover -c -v" (also tried "normal" "db_recover -v").
This command cames back with the message that db recovery completed
successfully:

db_recover: Finding last valid log LSN: file: 101 offset 36674186
db_recover: Recovery starting from [1][28]
db_recover: Recovery complete at Sun Jun 13 14:57:52 2004
db_recover: Maximum transaction ID 800e158c Recovery checkpoint [101][36674186]


If I'm now starting OpenLDAP again I get the following messages:

....
bdb(l=root): PANIC: fatal region error detected; run recovery
bdb_db_open: dbenv_open failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30978)
backend_startup: bi_db_open(0) failed! (-30978)
bdb(l=root): txn_checkpoint interface requires an environment configured for the transaction subsystem
bdb_db_destroy: txn_checkpoint failed: Invalid argument (22)
slapd stopped.
connections_destroy: nothing to destroy.


Hmmm... The database according to db_recover should now be in a consistent
state but OpenLDAP doesn't share this opinion with db_recover. Well according
to the first messages we are running out of resources needed. That could happen.
But as I already mentioned above a db_recover should bring the database back in
consistent state. And according to db_recover this should be the case. But still
I can't start OpenLDAP quitting with the message mentioned above. This was
yesterday. Today I started the "db_recover -c -v" again followed by starting
OpenLDAP again. And OpenLDAP started without problems! What happend? Something
must changed since yesterday. But I haven't changed anything... Well I think that
OpenLDAP still claimed resources for a while after it crashed which today where
freed by the kernel. So like Quanah mentioned it seams that I have to increase
kernel resources again. Maybe if I just have had rebooted the server yesterday
I could have continued with ldapadd. It haven't had solved the problem just a
quick "hack" of course ;-)


To make everything complete here is the configuration I used (hope I have all included):

DB_CONFIG:
set_cachesize           0       524288000       0
set_shm_key             1
set_lg_regionmax        1048576
set_lg_max              52428800
set_lg_bsize            2097152
set_lk_max_lockers      1000 # default
set_lk_max_locks        1000 # default
set_lk_max_objects      1000 # default
set_tx_max              100

slapd.conf (extract of relvant settings):
loglevel        96
idletimeout     10
sizelimit       unlimited
threads         16
cachesize       10000
checkpoint      1024    1

(Note: A checkpoint every 1 min. or 1024 kByte should be really
no problem. The I/O subsystem is happy with this settings. I set
it this low because I don't want loose to much information in case
of a crash.)

Kernel resources:
cat /proc/sys/kernel/sem
250     32000   32      128
cat /proc/sys/kernel/shmall
2147483648
cat /proc/sys/kernel/shmmax
2147483648
cat /proc/sys/kernel/shmmni
4096
cat /proc/sys/kernel/msgmax
8192
cat /proc/sys/kernel/msgmnb
16384
cat /proc/sys/kernel/msgmni
1024

Hardware:
Fujitsu Siemens RX300
2x Intel Xeon CPU 3.06GHz
2 GByte RAM
I/O Compaq EVA SAN (this is NOT a remote filesystem like NFS! It's SCSI over FibreChannel)


OS:
Redhat ES 3 (Update 1), Kernel 2.4.21
libc 2.3.2

Now for the next test I will now increase the kernel resources. I will take
the recommendations Oracle suggests for SHM/SEM because I haven't found such (good)
informations for OpenLDAP until now.


Cheers,
Robert


Trevor Warren wrote:

--- Wesley D Craig <wes@umich.edu> wrote:

On 12 Jun 2004, at 07:48, Trevor Warren wrote:
think this tells you anything about scale. This *might* tell you
something about how easy it is to give the application to someone who
doesn't know anything. The test you're performing might be


[snip]

 With about half a decade with Floss i may not be a
guru at it but i surely know a thing or two about
/proc  configurations and appropriate hdparms that
could set your config vroooming.

 Thanks for all the criticism wes.

Trevor





appropriate if you're hoping to repackage and distribute OpenLDAP to 50
million customers.
:wes





=====
( >-                                           -< )
/~\    ______________________________________   /~\
|  \) /    Scaling FLOSS in the Enterprise   \ (/ |
|_|_  \        trevorwarren@yahoo.com        / _|_|
      \____________________________________/




__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/