Full_Name: John Borwick Version: 2.2.13 OS: Red Hat Workstation 3 URL: http://www.wfu.edu/~borwicjh/examples/openldap-2.2.13-segfault/ Submission from: (NULL) (152.17.53.226) First, thanks very much for OpenLDAP! 2.2 seems really fast! I'm running openldap 2.2.13 with BDB 4.2.52. Both BDB patches have been applied, along with some crazy patches from Red Hat. Maybe that's a problem, I don't know. After hitting the "o=WFU,c=US" backend (which rewrites to "ou=Users,dc=wfu,dc=edu") maybe 10000 times, as fast as possible, the server segfaults. Here's a running count of the number of LDAP connections and the "backtrace full" output. Some symbols are missing; please let me know if this isn't enough data. We *did* compile with "--enable-ldap" and "--enable-rewrite". Please see the URL http://www.wfu.edu/~borwicjh/examples/openldap-2.2.13-segfault/ for information on how to replicate. Thank you very much! John -=-=- while true; do lsof -i :389 | wc -l; sleep 2; done 0 2 2 129 264 663 1015 1017 1017 1017 1017 1017 1017 1017 1017 1017 1017 1017 990 683 731 1017 1017 1017 1017 1017 1017 926 632 319 382 293 350 350 350 350 350 350 350 0 0 0 -=-=- gdb servers/slapd/slapd core -=-=- #0 0x00000001 in ?? () No symbol table info available. #1 <signal handler called> No symbol table info available. #2 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 No symbol table info available. #3 0xb737c8eb in __write_nocancel () from /lib/tls/libpthread.so.0 No symbol table info available. #4 0x08108f95 in sb_stream_write (sbiod=0x81ef5b0, buf=0x8fb39950, len=94) at sockbuf.c:549 No locals. #5 0x08109835 in sb_debug_write (sbiod=0x81ef5c8, buf=0x8fb39950, len=94) at sockbuf.c:846 ret = -1884291056 #6 0x08108eb1 in ber_int_sb_write (sb=0x81ec6d8, buf=0x8fb39950, len=94) at sockbuf.c:433 ret = -1884291056 #7 0x08105a9e in ber_flush (sb=0x81ec6d8, ber=0x8fb4ec90, freeit=0) at io.c:243 towrite = 94 rc = -1800410192 #8 0x080f0c54 in ldap_int_flush_request (ld=0x81eda20, lr=0x8fb4ed08) at request.c:166 lc = (LDAPConn *) 0x81ef510 #9 0x080f0fad in ldap_send_server_request (ld=0x81eda20, ber=0x8fb4ec90, msgid=13991, parentreq=0x0, srvlist=0x0, lc=0x81ef510, bind=0x0) at request.c:294 lr = (LDAPRequest *) 0x8fb4ed08 incparent = 0 rc = 0 #10 0x080f0bf7 in ldap_send_initial_request (ld=0x81eda20, msgtype=99, dn=0x8fb97ad8 "ou=Users,dc=wfu,dc=edu", ber=0x8fb4ec90, msgid=13991) at request.c:147 servers = (LDAPURLDesc *) 0x0 rc = 136239648 #11 0x080e2011 in ldap_search_ext (ld=0x81eda20, base=0x8fb97ad8 "ou=Users,dc=wfu,dc=edu", scope=2, filter=0x8b78cb00 "(|(cn=sue*)(mail=sue*)(sn=sue*))", attrs=0x0, attrsonly=0, sctrls=0x0, cctrls=0x0, timeout=0x94afd7a0, sizelimit=500, msgidp=0x94afd790) at search.c:110 rc = 0 ber = (BerElement *) 0x8fb4ec90 timelimit = 3600 id = 13991 #12 0x080b342a in ldap_back_search (op=0x8b3992c0, rs=0x94afe870) at search.c:143 li = (struct ldapinfo *) 0x819a9b8 lc = (struct ldapconn *) 0x81ee408 tv = {tv_sec = 3600, tv_usec = 0} res = (LDAPMessage *) 0x8099845 e = (LDAPMessage *) 0x94afd7c8 rc = 0 msgid = -1959161152 match = {bv_len = 0, bv_val = 0x0} mapped_attrs = (char **) 0x0 mbase = {bv_len = 22, bv_val = 0x8fb97ad8 "ou=Users,dc=wfu,dc=edu"} mfilter = {bv_len = 32, bv_val = 0x8b78cb00 "(|(cn=sue*)(mail=sue*)(sn=sue*))"} dontfreetext = 0 dc = {rwmap = 0x819a9f4, conn = 0x96a9fc88, ctx = 0x8124e53 "searchBase", rs = 0x94afe870} #13 0x0805cbab in do_search (op=0x8b3992c0, rs=0x94afe870) at search.c:400 base = {bv_len = 10, bv_val = 0x86506e47 "o=WFU,c=US"} siz = 0 off = 0 i = 0 manageDSAit = 0 be_manageDSAit = 0 #14 0x0805a551 in connection_operation (ctx=0x94afe900, arg_v=0x8b3992c0) at connection.c:1042 rc = -1025 op = (Operation *) 0x8b3992c0 rs = {sr_type = REP_RESULT, sr_tag = 0, sr_msgid = 0, sr_err = 0, sr_matched = 0x0, sr_text = 0x0, sr_ref = 0x0, sr_ctrls = 0x0, sr_un = {sru_sasl = { r_sasldata = 0x0}, sru_extended = {r_rspoid = 0x0, r_rspdata = 0x0}, sru_search = {r_entry = 0x0, r_attrs = 0x0, r_nentries = 0, r_v2ref = 0x0}}, sr_flags = 0} tag = 99 oldtag = 99 conn = (Connection *) 0x96a9fc88 memctx = (void *) 0x8206058 memctx_null = (void *) 0x0 memsiz = 1048576 #15 0x080de3b6 in ldap_int_thread_pool_wrapper (xpool=0x8154fb8) at tpool.c:467 pool = (struct ldap_int_thread_pool_s *) 0x8154fb8 ctx = (ldap_int_thread_ctx_t *) 0x865f94b8 ltc_key = {{ltk_key = 0x8097a48, ltk_data = 0x8206058, ltk_free = 0x8097a18 <sl_mem_destroy>}, {ltk_key = 0x81e4018, ltk_data = 0x13f, ltk_free = 0x80bc6d0 <bdb_locker_id_free>}, {ltk_key = 0x80af37d, ltk_data = 0x890fe008, ltk_free = 0x80af365 <search_stack_free>}, {ltk_key = 0x0, ltk_data = 0x0, ltk_free = 0} <repeats 29 times>} tid = 2494557104 i = 734 keyslot = 734 hash = 734 #16 0xb7377dac in start_thread () from /lib/tls/libpthread.so.0 No symbol table info available. #17 0xb7316a8a in clone () from /lib/tls/libc.so.6 No symbol table info available.
According to the link you sent, each instance of your application is trying to send 512 simultanoeus requests to slapd: perl load-test.pl --server=server-name --num-forks=512 since slapd cannot handle more than 1024 file descriptors (as far as I know, because of an intrinsic limitation in glibc's select) you're likely to be exausting system resources. The core dump you're showing is meaningless to me, because it shows the error occurring in an obscure and generic internal of glibc rather than in some specific part of slapd, starting from generic low level I/O routines of libldap. Can you reproduce the problem with a more limited load? p. -- Pierangelo Masarati mailto:pierangelo.masarati@sys-net.it -- Pierangelo Masarati mailto:pierangelo.masarati@sys-net.it SysNet - via Dossi,8 27100 Pavia Tel: +390382573859 Fax: +390382476497
Also, note that if you submit a large number of simultaneous connections, those that exceed the number of available threads are queued and remain pending. I guess the sigsegv is a bug, and it would be nice to be able to track it down. I haven't been able to generate it on my system, so it might be something related to your setupo, or at least something that depends on the rest of the environmet. However, in your case, if you think your production system may be undergoing a high load, you might try to increase the number of available threads. p. > Pierangelo Masarati wrote: >> According to the link you sent, each instance of your application >> is trying to send 512 simultanoeus requests to slapd: >> >> perl load-test.pl --server=server-name --num-forks=512 >> >> since slapd cannot handle more than 1024 file descriptors (as far >> as I know, because of an intrinsic limitation in glibc's select) >> you're likely to be exausting system resources. The core dump >> you're showing is meaningless to me, because it shows the error >> occurring in an obscure and generic internal of glibc rather than >> in some specific part of slapd, starting from generic low level I/O >> routines of libldap. Can you reproduce the problem with a more >> limited load? > > Yes, with "--num-forks=32" run on each of two machines, the server still > crashes with the same problem. It performed fine with "--num-forks=16" > and "--num-forks=24". > > Please consider that the # of file descriptors is at least doubled, > because the LDAP backend is being used for each request to rewrite from > "o=WFU,c=US" to "ou=Users,dc=wfu,dc=edu". > > With "lsof" monitoring, the pattern seems to be > 1. normal # conns > 2. quickly increasing # conns > 3. hanging until one or both processes killed > 4. unresponsive until # connections goes down > 5. normal # conns > 6. a lockup > 7. crash > > During testing, I may have found a better gdb backtrace, too! Check out > the "__assert_fail" statement. Thank you very much! > > #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 > No symbol table info available. > #1 0xb7262a29 in raise () from /lib/tls/libc.so.6 > No symbol table info available. > #2 0xb7264255 in abort () from /lib/tls/libc.so.6 > No symbol table info available. > #3 0xb725c559 in __assert_fail () from /lib/tls/libc.so.6 > No symbol table info available. > #4 0x0806ad32 in slap_op_free (op=0x8b6405d8) at operation.c:66 > slap_empty_bv_dup = {bv_len = 2433723312, bv_val = 0xb7380b7c > "xÚ"} > #5 0x0807391e in do_abandon (op=0x8233d40, rs=0x910fa870) at > abandon.c:107 > id = 1971 > o = (Operation *) 0x8b6405d8 > i = 7 > #6 0x0805a591 in connection_operation (ctx=0x910fa900, arg_v=0x8233d40) > at connection.c:1047 > rc = -1025 > op = (Operation *) 0x8233d40 > rs = {sr_type = REP_RESULT, sr_tag = 0, sr_msgid = 0, sr_err = > 0, sr_matched = 0x0, sr_text = 0x0, > sr_ref = 0x0, sr_ctrls = 0x0, sr_un = {sru_sasl = {r_sasldata = 0x0}, > sru_extended = {r_rspoid = 0x0, > r_rspdata = 0x0}, sru_search = {r_entry = 0x0, r_attrs = 0x0, > r_nentries = 0, r_v2ref = 0x0}}, sr_flags = 0} > tag = 80 > oldtag = 80 > conn = (Connection *) 0x96a9d088 > memctx = (void *) 0x82d7930 > memctx_null = (void *) 0x0 > memsiz = 1048576 > #7 0x080de3b6 in ldap_int_thread_pool_wrapper (xpool=0x8154fc0) at > tpool.c:467 > pool = (struct ldap_int_thread_pool_s *) 0x8154fc0 > ctx = (ldap_int_thread_ctx_t *) 0x90874260 > ltc_key = {{ltk_key = 0x8097a48, ltk_data = 0x82d7930, ltk_free > = 0x8097a18 <sl_mem_destroy>}, { > ltk_key = 0x81ad600, ltk_data = 0x132, ltk_free = 0x80bc6d0 > <bdb_locker_id_free>}, {ltk_key = 0x80af37d, > ltk_data = 0x88dfd008, ltk_free = 0x80af365 <search_stack_free>}, > {ltk_key = 0x0, ltk_data = 0x0, > ltk_free = 0} <repeats 29 times>} > tid = 2433723312 > i = 507 > keyslot = 507 > hash = 507 > #8 0xb7377dac in start_thread () from /lib/tls/libpthread.so.0 > No symbol table info available. > #9 0xb7316a8a in clone () from /lib/tls/libc.so.6 > No symbol table info available. > (gdb) bt full > #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 > No symbol table info available. > #1 0xb7262a29 in raise () from /lib/tls/libc.so.6 > No symbol table info available. > #2 0xb7264255 in abort () from /lib/tls/libc.so.6 > No symbol table info available. > #3 0xb725c559 in __assert_fail () from /lib/tls/libc.so.6 > No symbol table info available. > #4 0x0806ad32 in slap_op_free (op=0x8b6405d8) at operation.c:66 > slap_empty_bv_dup = {bv_len = 2433723312, bv_val = 0xb7380b7c > "xÚ"} > #5 0x0807391e in do_abandon (op=0x8233d40, rs=0x910fa870) at > abandon.c:107 > id = 1971 > o = (Operation *) 0x8b6405d8 > i = 7 > #6 0x0805a591 in connection_operation (ctx=0x910fa900, arg_v=0x8233d40) > at connection.c:1047 > rc = -1025 > op = (Operation *) 0x8233d40 > rs = {sr_type = REP_RESULT, sr_tag = 0, sr_msgid = 0, sr_err = > 0, sr_matched = 0x0, sr_text = 0x0, > sr_ref = 0x0, sr_ctrls = 0x0, sr_un = {sru_sasl = {r_sasldata = 0x0}, > sru_extended = {r_rspoid = 0x0, > r_rspdata = 0x0}, sru_search = {r_entry = 0x0, r_attrs = 0x0, > r_nentries = 0, r_v2ref = 0x0}}, sr_flags = 0} > tag = 80 > oldtag = 80 > conn = (Connection *) 0x96a9d088 > memctx = (void *) 0x82d7930 > memctx_null = (void *) 0x0 > memsiz = 1048576 > #7 0x080de3b6 in ldap_int_thread_pool_wrapper (xpool=0x8154fc0) at > tpool.c:467 > pool = (struct ldap_int_thread_pool_s *) 0x8154fc0 > ctx = (ldap_int_thread_ctx_t *) 0x90874260 > ltc_key = {{ltk_key = 0x8097a48, ltk_data = 0x82d7930, ltk_free > = 0x8097a18 <sl_mem_destroy>}, { > ltk_key = 0x81ad600, ltk_data = 0x132, ltk_free = 0x80bc6d0 > <bdb_locker_id_free>}, {ltk_key = 0x80af37d, > ltk_data = 0x88dfd008, ltk_free = 0x80af365 <search_stack_free>}, > {ltk_key = 0x0, ltk_data = 0x0, > ltk_free = 0} <repeats 29 times>} > tid = 2433723312 > i = 507 > keyslot = 507 > hash = 507 > #8 0xb7377dac in start_thread () from /lib/tls/libpthread.so.0 > No symbol table info available. > #9 0xb7316a8a in clone () from /lib/tls/libc.so.6 > No symbol table info available. > > > -- > John Borwick > Systems Administrator > Wake Forest University | web http://www.wfu.edu/~borwicjh > Winston-Salem, NC, USA | GPG key ID 56D60872 > -- Pierangelo Masarati mailto:pierangelo.masarati@sys-net.it SysNet - via Dossi,8 27100 Pavia Tel: +390382573859 Fax: +390382476497
Pierangelo Masarati wrote: > Also, note that if you submit a large number of simultaneous connections, > those that exceed the number of available threads are queued and remain > pending. I guess the sigsegv is a bug, and it would be nice to be able > to track it down. I haven't been able to generate it on my system, so it > might be something related to your setupo, or at least something that > depends on the rest of the environmet. However, in your case, if you > think your production system may be undergoing a high load, you might try > to increase the number of available threads. > > p. OK. I recompiled with "--enable-threads=no" and still get crashes. Should that eliminate threads as a problem? If I do the "--num-forks=512" test with the two machines hitting the *BDB* backend, there is no crash. The entire test case completes fine. It seems that only the *LDAP* backend is causing a crash. (This could be due to something else, though, like a linear vs. exponential demand on resources.) What's weird to me is that "libpthread" is still linked in even when "--enable-threads=no": # ldd `which slapd` libdb-4.2.so => /usr/lib/libdb-4.2.so (0xb7500000) libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0xb74ea000) libssl.so.4 => /lib/libssl.so.4 (0xb74b5000) libcrypto.so.4 => /lib/libcrypto.so.4 (0xb73c3000) libcrypt.so.1 => /lib/libcrypt.so.1 (0xb7396000) libresolv.so.2 => /lib/libresolv.so.2 (0xb7384000) libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7373000) libc.so.6 => /lib/tls/libc.so.6 (0xb723b000) libdl.so.2 => /lib/libdl.so.2 (0xb7238000) libgssapi_krb5.so.2 => /usr/kerberos/lib/libgssapi_krb5.so.2 (0xb7225000) libkrb5.so.3 => /usr/kerberos/lib/libkrb5.so.3 (0xb71c7000) libcom_err.so.3 => /usr/kerberos/lib/libcom_err.so.3 (0xb71c5000) libk5crypto.so.3 => /usr/kerberos/lib/libk5crypto.so.3 (0xb71b4000) libz.so.1 => /usr/lib/libz.so.1 (0xb71a6000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xb75eb000) Are there any potential incompatibilities with threading between OpenLDAP and these libraries? Are there other libraries I should recompile/upgrade/remove to do further testing? Thank you very very much, John -- John Borwick Systems Administrator Wake Forest University | web http://www.wfu.edu/~borwicjh Winston-Salem, NC, USA | GPG key ID 56D60872
> Pierangelo Masarati wrote: >> Also, note that if you submit a large number of simultaneous >> connections, >> those that exceed the number of available threads are queued and remain >> pending. I guess the sigsegv is a bug, and it would be nice to be able >> to track it down. I haven't been able to generate it on my system, so >> it >> might be something related to your setupo, or at least something that >> depends on the rest of the environmet. However, in your case, if you >> think your production system may be undergoing a high load, you might >> try >> to increase the number of available threads. >> >> p. > > OK. I recompiled with "--enable-threads=no" and still get crashes. > Should that eliminate threads as a problem? There's no --enable-threads switch in OpenLDAP's configure. There's a --with-threads one. In any case, I think back-ldap definitely needs threads and, provided the system threads are not buggy, their use with slapd should be relatively safe and beneficial in all cases. I think you just need to boost the number of simultaneous threads your slapd can handle. The default is 16, and if you want to deal with 512 simultaneous connections you could try "threads 64" or "threads 128" (if your hardware can stand it, i.e. you are using a 2/4 CPU system with a lot of ram and overall good performance, including network bandwidth). Otherwise, you cannot simply accept so many simultaneous connections with your hardware, sigsegv or not. > > If I do the "--num-forks=512" test with the two machines hitting the > *BDB* backend, there is no crash. The entire test case completes fine. > It seems that only the *LDAP* backend is causing a crash. (This could > be due to something else, though, like a linear vs. exponential demand > on resources.) Are the back-ldap and back-bdb in the same slapd? If not, are they on the same machine? On my laptop (a much older RH 7.1) when I try such an intensive test, the system runs out of file descriptors way before 128 simultaneous processes are started, and slapd hangs after a while. However, when I kill the requests and the machine load decreases a bit, the slapd goes (slowly) back to service. I used your config file, and I hit a test database containing a few tenths of entries, but this should not be an issue. p. > > > What's weird to me is that "libpthread" is still linked in even when > "--enable-threads=no": > > # ldd `which slapd` > libdb-4.2.so => /usr/lib/libdb-4.2.so (0xb7500000) > libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0xb74ea000) > libssl.so.4 => /lib/libssl.so.4 (0xb74b5000) > libcrypto.so.4 => /lib/libcrypto.so.4 (0xb73c3000) > libcrypt.so.1 => /lib/libcrypt.so.1 (0xb7396000) > libresolv.so.2 => /lib/libresolv.so.2 (0xb7384000) > libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7373000) > libc.so.6 => /lib/tls/libc.so.6 (0xb723b000) > libdl.so.2 => /lib/libdl.so.2 (0xb7238000) > libgssapi_krb5.so.2 => /usr/kerberos/lib/libgssapi_krb5.so.2 > (0xb7225000) > libkrb5.so.3 => /usr/kerberos/lib/libkrb5.so.3 (0xb71c7000) > libcom_err.so.3 => /usr/kerberos/lib/libcom_err.so.3 (0xb71c5000) > libk5crypto.so.3 => /usr/kerberos/lib/libk5crypto.so.3 > (0xb71b4000) > libz.so.1 => /usr/lib/libz.so.1 (0xb71a6000) > /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xb75eb000) > > Are there any potential incompatibilities with threading between > OpenLDAP and these libraries? Are there other libraries I should > recompile/upgrade/remove to do further testing? > > Thank you very very much, -- Pierangelo Masarati mailto:pierangelo.masarati@sys-net.it SysNet - via Dossi,8 27100 Pavia Tel: +390382573859 Fax: +390382476497
Pierangelo Masarati wrote: >>OK. I recompiled with "--enable-threads=no" and still get crashes. >>Should that eliminate threads as a problem? > > > There's no --enable-threads switch in OpenLDAP's configure. There's a > --with-threads one. Whoops! > In any case, I think back-ldap definitely needs threads and, provided the > system threads are not buggy, their use with slapd should be relatively > safe and beneficial in all cases. I think you just need to boost the > number of simultaneous threads your slapd can handle. The default is 16, > and if you want to deal with 512 simultaneous connections you could try > "threads 64" or "threads 128" (if your hardware can stand it, i.e. you are > using a 2/4 CPU system with a lot of ram and overall good performance, > including network bandwidth). Otherwise, you cannot simply accept so many > simultaneous connections with your hardware, sigsegv or not. Excellent. With "threads 128" our dual CPU (+SMP) machine handled the 4500 test queries without crashing! We've actually been having our production LDAP server crash all the time (at least once a day) due to what I'm hoping is this problem. I'm going to increase the number of threads there and see if that helps. Here is a theory for you: Does the BDB backend accept queries only as fast as it can actually resolve them, whereas the LDAP backend accepts queries as soon as they are received and start queuing them up? -- John Borwick Systems Administrator Wake Forest University | web http://www.wfu.edu/~borwicjh Winston-Salem, NC, USA | GPG key ID 56D60872
> Pierangelo Masarati wrote: >>>OK. I recompiled with "--enable-threads=no" and still get crashes. >>>Should that eliminate threads as a problem? >> >> >> There's no --enable-threads switch in OpenLDAP's configure. There's a >> --with-threads one. > > Whoops! > >> In any case, I think back-ldap definitely needs threads and, provided >> the >> system threads are not buggy, their use with slapd should be relatively >> safe and beneficial in all cases. I think you just need to boost the >> number of simultaneous threads your slapd can handle. The default is >> 16, >> and if you want to deal with 512 simultaneous connections you could try >> "threads 64" or "threads 128" (if your hardware can stand it, i.e. you >> are >> using a 2/4 CPU system with a lot of ram and overall good performance, >> including network bandwidth). Otherwise, you cannot simply accept so >> many >> simultaneous connections with your hardware, sigsegv or not. > > Excellent. With "threads 128" our dual CPU (+SMP) machine handled the > 4500 test queries without crashing! > > We've actually been having our production LDAP server crash all the time > (at least once a day) due to what I'm hoping is this problem. I'm going > to increase the number of threads there and see if that helps. > > Here is a theory for you: > > Does the BDB backend accept queries only as fast as it can actually > resolve them, whereas the LDAP backend accepts queries as soon as they > are received and start queuing them up? I might not be the most appropriate person to answer your question; as far as I can tell, the frontend accepts connections, and concurrently handles as many connections as threads are available in the main pool (that's one of the reasons for not compiling --without-threads...). Connections are handled by calling backends as appropriate. Back-bdb has to do some work, while back-ldap forwards requests and waits for response. I guess if the remote server is not much responsive, back-ldap may submit too many concurrent requests and idle while they're answered. Here the frontend starts queuing further connections. In any case, the frontend that accepts and queues connections is the same for all backends, so at this level there should be no difference. p. -- Pierangelo Masarati mailto:pierangelo.masarati@sys-net.it SysNet - via Dossi,8 27100 Pavia Tel: +390382573859 Fax: +390382476497
changed notes
changed state Open to Closed
moved from Incoming to Archive.Incoming
could not reproduce; OS/resource exaustion related?