[Date Prev][Date Next] [Chronological] [Thread] [Top]

delta-syncrepl problems with 2.4.12



I'm upgrading a site from OpenLDAP 2.3.42 to 2.4.12 in an attempt to
alleviate http://www.openldap.org/its/index.cgi/Incoming?id=5631 (slapd
crashes due to assertion failure).

For us, doing so requires dumping and recreating two back-bdb databases
since the OpenLDAP 2.4.x Debian packaging is linked against a newer version
of BDB. The larger database contains about a million entries.

Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the
existing databases and allowing slapd(8) to delta-syncrepl a copy from
scratch. Ironing out this use case is especially important for us since we
expect to be adding a number of consumers in the coming months and would
obviously prefer to bring them online without having to shut down any other
slapd instances for slapcat(8)ting. The Administrator's Guide seems to
indicate this is an accepted use case, since its guide to bringing up a new
consumer involves simply configuring the consumer and starting slapd.

When the consumer slapd comes up, it enters the refresh(?) phase and begins
adding entries to the fresh, empty bdb database. When it finishes,
contextCSN on the suffix entry is set to 20081111135024Z#000000#00#000000
(roughly when slapd was started) and this change is visible with
ldapsearch(1).

At this point, slurpd seems to start processing the accesslog. The first
entry references a nonexistent DN (uid=nava209,...) and the backend
operation returns LDAP_NO_SUCH_OBJECT. This is interesting, since this entry
was created months ago should have been found during the refresh phase and
created. ldapsearch(1)ing against the provider with the same filter used by
the consumer syncrepl ('(objectclass=*)') yields this entry, so it doesn't
appear to be index corruption on the provider.

At this point, several hundred subsequent search entries are discarded;
possibly due to the be_modify operation failing?

After some time, slapd continues processing entries and does so successfully
until it encounters another error (a modrdn that returns LDAP_ALREADY_EXISTS
since the accesslog entry that modrdn'd the existing object out of the way
was ignored by the consumer). After a while, slapd starts processing the
same batch of modifications again, and repeats until the retry counter is
exhausted. contextCSN on the suffix entry is never updated during this
process, based on debugging output and ldapsearch(1).

It's interesting that two consumers have successfully delta-syncrepl'd
complete databases from scratch without experiencing this problem. At least
four other consumer machines fail in this manner. There seems to be no rhyme
or reason as to which machines succeed or fail; they're all running the same
binaries, same OS release and patches, some are even on the same Ethernet
segment as the provider. The provider slapd has been up consistently
(without crash nor restart) during at least two attempts.

Syncrepl (level 16384) debug output, sans ~400Mbytes of entry processing
during the refresh phase, is at:

  http://horde.net/~jwm/slapd-syncrepl-debug

john
-- 
John Morrissey          _o            /\         ----  __o
jwm@horde.net        _-< \_          /  \       ----  <  \,
www.horde.net/    __(_)/_(_)________/    \_______(_) /_(_)__