[Date Prev][Date Next] [Chronological] [Thread] [Top]

slurpd 2.1.22+db4.1.25+RH7.2-3 blocks slave' threads (too switch context x86)



Some help, track or so may be welcome.
Thanks in advance.

I recently migrated from OpenLdap 2.0 to 2.1.
We have 6 backends with 90.000, 29.000, 5.000, 2.000, 1.500 and 50 entries.
The search response is very fine (~16ms) in 2xXeon (IBM 345) and a bit more in the 2xPIII (HP TC3100) 2GB RAM. Red Hat 7.3 (2.4.21).


The very *big* problem comes with the replication.
We posted several days ago an other mail ("CPU usage and #ITS2383 (sched_yield()", the 2003/08/19). Several people tell us they has te same problem.


The test we made are with:
Red Hat 7.2,7.3 (kernels 2.4.19, 2.4.20, 2.4.21), ext3 fs, bdb-4.1.25
one or two slaves, 6 backends. With xeon or PIII dual-processor machines. We run db_recover before any test.


We can runs the slurpd in one-shoot mode several times against the slave with several sizes (1, 12, 100, 20000, 30000) updates, but after 2 or 3 times as maximum the slurpd breaks the conection to the slave (also the tcpdump show this).

With RH 7.2 the one-shoot runs more times without clog the slave or slaves; also the high debug in the slurpd yieds a bit more stability (dead_lock time?).
When the slurpd runs as daemon, the frozen requires only a bunch or lines in the replica (i.e. ldapmodify with 12 entries or less may be required)... perhaps one time or two the ldapmodify's changes are replicated, but no more.


The syptoms:
1. the master:
- the slapd don't care at all: well response and stability. Fine.
- the slurpd needs a kill -9 to stop or also requires kill the slapd-slave to allow be stopped.
- the slurpd waits quietly and for ever (debug 2):
ldap_write: want=269, written=269
0000: 30 82 01 09 02 02 00 e1 66 81 e5 04 20 75 69 64 0.......f... ...
...
0100: 2e 31 31 33 37 33 30 2e 33 2e 34 2e 32 .113730.3.4.2
(silent)



2. the slave(one or two replicas):
- the slapd answer searches in a very variable speed: from 18s to 24ms.
- high system cpu usage (~100%)in all processors (user ~35%, system ~65%)
- high sc in the vmstat (system|cs~46.000)
- the tcpdump shows that the conection between slurpd and slave was cut (there are oter connections: clients, services, or from the ldap master but nor slurpd).
- after restart the slave-slapd (and runs db_recover *), the slave/slurpd reconects, and renegotiate the remain replog perfectly: this *always* goes well after a slave-slapd restart.
- also a great script (20.000 mods) may runs very well (cpu 96% idle) some times with slurpd as daemon *but* a debug level (-d 2) slowering the time response? The high debug in the slurpd seems yieds more stability...
- the slave-slapd is *more robust* in the RH7.2+2.4.20 than RH7.3+2.4.21.
- when the slurpd frozen the slave-slapd' threads, any ldapmodify over the slave (as replicator) don't respond: frozen in the very first entry.



I think that the slurpd calls for threads in the slave in a such way that loses the control or conection. This sypmtom is reproducible with a ldapmodify script when the slurpd is up. When the slurpd is called in one-shoot mode, the time required may allow the thread sunc. well, but not ever.
This sounds me if the start replica-process in the slave may generate some dead-lock at very start of the replica. My mind is that some shorts mods may clog the slapd-slave better than long ones (too fast?)


Some aditional data:

configure scrip:
---------------------------------------------------
env CPPFLAGS="-I/usr/local/etc2/openssl/include -I/usr/local/etc2/bdb4/include" \
LDFLAGS="-L/usr/local/etc2/openssl/lib -L/usr/local/etc2/bdb4/lib" \
./configure --with-tls --with-threads --enable-crypt --enable-monitor --prefix=/usr/local/etc2/openldap_2_1


---- RH 7.2, kernel 2.4.20 -----
ldd -v /usr/local/etc2/openldap_2_1/libexec/slapd
        libpam.so.0 => /lib/libpam.so.0 (0x4001d000)
        libcrypt.so.1 => /lib/libcrypt.so.1 (0x40025000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x40052000)
        libdl.so.2 => /lib/libdl.so.2 (0x40064000)
        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40068000)
        libc.so.6 => /lib/i686/libc.so.6 (0x4007d000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

---- RH 7.3, kernel 2.4.21------------
ldd -v /usr/local/etc2/openldap_2_1/libexec/slapd
        libcrypt.so.1 => /lib/libcrypt.so.1 (0x40017000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x40045000)
        libdl.so.2 => /lib/libdl.so.2 (0x40056000)
        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40059000)
        libc.so.6 => /lib/i686/libc.so.6 (0x42000000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

-- one of the DB_CONFIGs --
#set_flags DB_TXN_NOSYNC
set_cachesize 0  134217728 1
set_lg_regionmax 262144
set_lg_bsize 2097152
--------

also we have
  checkpoint  512 720
and
  cachesize       5000 (as the size in entries)
-----------

backends in slapd-conf:
-----------------------
o=test1,dc=unav,dc=es
o=test2,dc=unav,dc=es
o=test3,dc=unav,dc=es
o=test4,dc=unav,dc=es
o=test5,dc=unav,dc=es
------------------------

Thanks,
Ignacio

--
____________________________________________________
Ignacio Coupeau, Ph.D.     icoupeau@unav.es
CTI, Director              icoupeau@alumni.unav.es
University of Navarra      icoupeau@ieee.org
Pamplona, SPAIN            http://www.unav.es/cti/