[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#7113) broken read-only replica results in assertion failure and crash on master slapd process

To: openldap-its@OpenLDAP.org
Subject: (ITS#7113) broken read-only replica results in assertion failure and crash on master slapd process
From: kacarstensen@csupomona.edu
Date: Mon, 19 Dec 2011 19:18:50 GMT
Auto-submitted: auto-generated (OpenLDAP-ITS)

Full_Name: Kevan Carstensen
Version: 2.4.28
OS: gentoo linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (134.71.248.23)


During a heavy update load, replication from our master to one of our read-only
ldap replicas started failing due to locking errors on the replica.

Dec 14 04:01:08 pip-dev slapd[12645]: bdb(dc=csupomona,dc=edu): Lock table is
out of available lock entries
Dec 14 04:01:08 pip-dev slapd[12645]: => bdb_idl_delete_key: c_get failed:
Cannot allocate memory (12)
Dec 14 04:01:08 pip-dev slapd[12645]: conn=-1 op=0: attribute "memberUid" index
delete failure
Dec 14 04:01:08 pip-dev slapd[12645]: null_callback : error code 0x50
Dec 14 04:01:08 pip-dev slapd[12645]: syncrepl_entry: rid=001 be_modify failed
(80)
Dec 14 04:01:08 pip-dev slapd[12645]: do_syncrepl: rid=001 rc 80 retrying

(these are due to an incorrect DB_CONFIG on the replica and are not related to
this issue; I include them in case their impact on how/when the replica connects
to the master are important)

After about ten minutes of the replica trying and failing to start its syncrepl
again, the master slapd crashed. Nothing in the master slapd logs specifically
references the crash or indicates why it happened. Shortly after the replica
broke, some unusual messages started showing up in the logs on the master:

Dec 14 03:50:11 fosse slapd[22596]: connection_read(65): no connection!
Dec 14 03:50:22 fosse slapd[22596]: send_search_entry: conn 7535053  ber write
failed.
Dec 14 03:50:32 fosse slapd[22596]: send_search_entry: conn 7535061  ber write
failed.
Dec 14 03:50:42 fosse slapd[22596]: send_search_entry: conn 7535065  ber write
failed.
Dec 14 03:50:56 fosse slapd[22596]: send_search_entry: conn 7535068  ber write
failed.
Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection!
Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection!

These continued until the crash, persisted after the master slapd was
successfully restarted, and went away when I turned the broken replica off.

I found similar behavior in a development setup similar to our production setup.
There is one master, and there are two read-only replicas configured to
replicate from the master with the following syncrepl stanza:

syncrepl        rid=1
                provider=ldaps://master-2.ldap.csupomona.edu/
                type=refreshAndPersist
                retry="10 10 60 +"
                searchbase="dc=csupomona,dc=edu"
                bindmethod=simple
                binddn=cn=slapd-syncrepl,ou=user,ou=service,dc=csupomona,dc=edu
                credentials=__PASSWORD__

(there are two replicas because I couldn't reliably produce a crash with one
replica)

The replicas have a DB_CONFIG that causes their replication attempts to fail
with the same error as we saw in our production setup. The master database is
initialized so that the replicas have a lot of changes to catch up on, as was
the case when our production setup failed. Even with loglevel -1, there's
nothing specifically referencing the crash in the master slapd's logs. There's
an assertion failure in the backtrace:

slapd: daemon.c:942: slapd_clr_write: Assertion
`((slap_daemon[id].sd_index[(s)]) != -1)' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffea863700 (LWP 27623)]
0x00007ffff6a95175 in raise () from /lib64/libc.so.6
(gdb) bt full
#0  0x00007ffff6a95175 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff6a96590 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007ffff6a8e201 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3  0x000000000042ec9b in slapd_clr_write (s=<value optimized out>, wake=0) at
daemon.c:942
        id = 0
        __PRETTY_FUNCTION__ = "slapd_clr_write"
#4  0x000000000043419b in connection_write (s=17) at connection.c:1896
        c = <value optimized out>
        op = <value optimized out>
        __PRETTY_FUNCTION__ = "connection_write"
#5  0x000000000043006a in slapd_daemon_task (ptr=<value optimized out>) at
daemon.c:2782
        rc = 1
        fd = 17
        w = 6
        ns = 1
        at = 0
        revents = <value optimized out>
        tvp = <value optimized out>
        cat = {tv_sec = 1324320980, tv_usec = 0}
        i = 0
        now = <value optimized out>
        tv = {tv_sec = 1135, tv_usec = 0}
        rtask = <value optimized out>
        l = <value optimized out>
        last_idle_check = 1324317380
        ebadf = 0
        tid = <value optimized out>
#6  0x00007ffff71ed914 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x00007ffff6b347dd in clone () from /lib64/libc.so.6
No symbol table info available.

The master and replicas in my development setup all run 2.4.28. The production
master runs 2.4.26. I successfully reproduced the issue on both 2.4.26 and
2.4.28, but only have a backtrace for 2.4.28. If you want a backtrace for
2.4.26, I can get one.

Thanks.

Prev by Date: Re: (ITS#7104) refreshOnly syncrepl out of sync during bulk update/delete operations
Next by Date: Re: (ITS#7107) cn=config syncrepl fails with "controls require LDAPv3" on bound v3 connection
Index(es):
- Chronological
- Thread