[Date Prev][Date Next] [Chronological] [Thread] [Top]

openldap-2.1.29/db-4.2.52.2 and db environment issues



I am implementing a highly-available LDAP system for an ISP, clustering 
openldap-2.1.x/db-4.2.52.2 with back-bdb using Red Hat Cluster Manager 
(for the master, which must be highly available) and stand-alone slaves.

I have been testing both hot backups (via the steps recommended by the 
Berkeley DB documentation) and hot restores (by restoring a backup on a 
slave).

Until today, I had been using 2.1.25 with relative succes, except that 
checkpointing did not seem to take place correctly, so with large writes, 
log files would increase in number, until at some point (202 log files 
typically, with log files limited to 10MB) db_recover would 
no longer work. Restarting the ldap master would resolve this issue 
(checkpoint, and reduce the number of active transaction log files), but 
would be rather undesireable.

Now, it seems that at certain points in time, the database environment is 
not consistent. For example, I ran these commands seperated by about 5 
seconds while doing large writes on the LDAP master;
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
db_stat: DB->stat: DB_PAGE_NOTFOUND: Requested page not found
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
db_stat: DB->stat: DB_PAGE_NOTFOUND: Requested page not found
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
53162   Btree magic number.
9       Btree version number.
Flags:  duplicates, little-endian
2       Minimum keys per-page.
4096    Underlying database page size.
3       Number of levels in the tree.
272193  Number of unique keys in the tree.
350377  Number of data items in the tree.
109     Number of tree internal pages.
175052  Number of bytes free in tree internal pages (61% ff).
7998    Number of tree leaf pages.
10M     Number of bytes free in tree leaf pages (69% ff).
198     Number of tree duplicate pages.
25960   Number of bytes free in tree duplicate pages (97% ff).
0       Number of tree overflow pages.
0       Number of bytes free in tree overflow pages (0% ff).
0       Number of pages on the free list.


To give you more information on what I am doing here:

Since I wanted to test these features under load, I am importing an 
existing database which consists of approx 150000 entries used for 
qmail-ldap. I am running a hot backup script from cron (incidentally, I 
added it to the Mandrake openldap packages, so you can view it 
here:http://cvs.mandrakesoft.com/cgi-bin/cvsweb.cgi/SPECS/openldap/ldap-hot-db-backup 
)

So, I have a bdb database in /var/lib/ldap/mail, the script places a 
backup in /var/lib/ldap/backup/mail every 15 minutes. 
ldapmaster:/var/lib/ldap/ is mounted on /var/lib/ldap/master on the slave 
(thus the hot backup appears in /var/lib/ldap/master/backup/mail), and I 
use the following script to do hot restores: 
(http://cvs.mandrakesoft.com/cgi-bin/cvsweb.cgi/SPECS/openldap/ldap-reinitialise-slave) 

So, while the import is running, I run the following on the slave:
while true;do date;/usr/share/openldap/scripts/ldap-reinitialise-slave 
-v3;date; service ldap restart ; sleep 30;ldapsearch -x -b cn=mail,ou=isp 
-h localhost -LLL dn -z10 2>/dev/null|grep ^dn|wc -l;ldapsearch -x -b 
"ou=radius,o=intekom,c=za" -h localhost -LLL dn -z10 2>/dev/null|grep 
^dn|wc -l;sleep 30;done

When db_stat does not return an answer for dn2id.dbb, my hot backups fail, 
and restores fail even more miserably (yes, I should do a bit more error 
checking ... but it wasn't necessary on 2.1.25)

Now, this isn't a huge problem, it seems the next hot backup succeeds (if 
I remove the lock file manually for now), but it would seem to indicate 
other problems? And, under 2.1.29, I don't see the problem with 
checkpointing, the backup usually only contains 1 transaction log, rather 
than the 202 I saw under 2.1.25.

Unfortunately the systems I am testing on now are due to be used for other 
applications quite soon, so I will not have much time to debug this. 
Additionally, I need to decide on 2.1.25 vs 2.1.29 within about 2 days for 
a system that will most likely be running with minimal changes, and 
hopefully minimal maintenance for 2 years or more.

Well, I will leave the tests running overnight again, and see what I find 
in the morning ...

Regards,
Buchan