Issue 6021 - slapo-pcache/back-ldap slapd crashes
Summary: slapo-pcache/back-ldap slapd crashes
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: 2.4.15
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-13 10:31 UTC by toby@inf.ed.ac.uk
Modified: 2014-08-01 21:03 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description toby@inf.ed.ac.uk 2009-03-13 10:31:16 UTC
Full_Name: Toby Blake
Version: 2.4.15
OS: Scientific Linux 5.2
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (129.215.197.95)


Hi there,

I have been testing slapo-pcache/back-ldap for some time now,
typically in student lab situations (to provide a localhost
proxycaching slapd with back-ldap connecting (via gssapi) to several
remote servers), initially using 2.3.x but I have recently moved to
using the latest 2.4.x versions, as it's clear that's where the
development effort is going.

I'm seeing quite a lot of crashes during my testing, with a few
different backtraces.  I don't know whether you would prefer I
submitted these separately for different backtraces, so I'm opening
this ITS to get the process started...

I have seen no way of reproducing these crashes.

I'm currently using:
- openldap-2.4.15 on scientific linux 5.2
- bdb 4.7.25 with 3 patches
- We build our own RPMs.  I have openldap-2.4.15 with no optimisation
  (-O0) for the purposes of debugging.

The most common backtrace I see is an assertion failure in
memory.c:152.  Here's an example:

Program terminated with signal 6, Aborted.
#0  0x00b8d402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00b8d402 in __kernel_vsyscall ()
#1  0x009e8d20 in raise () from /lib/libc.so.6
#2  0x009ea631 in abort () from /lib/libc.so.6
#3  0x00a20e6b in __libc_message () from /lib/libc.so.6
#4  0x00a28b16 in _int_free () from /lib/libc.so.6
#5  0x00a2c070 in free () from /lib/libc.so.6
#6  0x081d5746 in ber_memfree_x (p=0xa6809478, ctx=0x0) at memory.c:152
#7  0x081d630c in ber_bvarray_free_x (a=0x9049bb8, ctx=0x0) at memory.c:731
#8  0x081d6343 in ber_bvarray_free (a=0x9049bb8) at memory.c:741
#9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134
#10 0x0807983b in attrs_free (a=0xb62d41bc) at attr.c:196
#11 0x0807c059 in entry_clean (e=0xb5acfdbc) at entry.c:504
#12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
    at search.c:354
#13 0x080f2b02 in overlay_op_walk (op=0xa680e5d8, rs=0xb5ad10e0, 
    which=op_search, oi=0x8fa9208, on=0x0) at backover.c:669
#14 0x080f2d0a in over_op_func (op=0xa680e5d8, rs=0xb5ad10e0, which=op_search)
    at backover.c:721
#15 0x080f2dae in over_op_search (op=0xa680e5d8, rs=0xb5ad10e0)
    at backover.c:743
#16 0x080720d6 in fe_op_search (op=0xa680e5d8, rs=0xb5ad10e0) at search.c:366
#17 0x08071a2c in do_search (op=0xa680e5d8, rs=0xb5ad10e0) at search.c:217
#18 0x0806e705 in connection_operation (ctx=0xb5ad11d0, arg_v=0xa680e5d8)
    at connection.c:1097
#19 0x0806ebdf in connection_read_thread (ctx=0xb5ad11d0, argv=0x13)
    at connection.c:1223
#20 0x0819ecad in ldap_int_thread_pool_wrapper (xpool=0x8f86fa0) at tpool.c:663
#21 0x00b3346b in start_thread () from /lib/libpthread.so.0
#22 0x00a90dbe in clone () from /lib/libc.so.6
(gdb) 

I have seen quite a few of these backtraces, although sometimes coming
through slap_sl_free (and a couple of other places) instead of
ber_bvarray_free_x.

Let me know what further information I can provide.

Many thanks
Toby Blake
School of Informatics
University of Edinburgh
Comment 1 ando@openldap.org 2009-03-14 15:49:51 UTC
toby@inf.ed.ac.uk wrote:
> Full_Name: Toby Blake
> Version: 2.4.15
> OS: Scientific Linux 5.2
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (129.215.197.95)
> 
> 
> Hi there,
> 
> I have been testing slapo-pcache/back-ldap for some time now,
> typically in student lab situations (to provide a localhost
> proxycaching slapd with back-ldap connecting (via gssapi) to several
> remote servers), initially using 2.3.x but I have recently moved to
> using the latest 2.4.x versions, as it's clear that's where the
> development effort is going.
> 
> I'm seeing quite a lot of crashes during my testing, with a few
> different backtraces.  I don't know whether you would prefer I
> submitted these separately for different backtraces, so I'm opening
> this ITS to get the process started...
> 
> I have seen no way of reproducing these crashes.
> 
> I'm currently using:
> - openldap-2.4.15 on scientific linux 5.2
> - bdb 4.7.25 with 3 patches
> - We build our own RPMs.  I have openldap-2.4.15 with no optimisation
>   (-O0) for the purposes of debugging.
> 
> The most common backtrace I see is an assertion failure in
> memory.c:152.  Here's an example:
> 
> Program terminated with signal 6, Aborted.
> #0  0x00b8d402 in __kernel_vsyscall ()
> (gdb) bt
> #0  0x00b8d402 in __kernel_vsyscall ()
> #1  0x009e8d20 in raise () from /lib/libc.so.6
> #2  0x009ea631 in abort () from /lib/libc.so.6
> #3  0x00a20e6b in __libc_message () from /lib/libc.so.6
> #4  0x00a28b16 in _int_free () from /lib/libc.so.6
> #5  0x00a2c070 in free () from /lib/libc.so.6
> #6  0x081d5746 in ber_memfree_x (p=0xa6809478, ctx=0x0) at memory.c:152
> #7  0x081d630c in ber_bvarray_free_x (a=0x9049bb8, ctx=0x0) at memory.c:731
> #8  0x081d6343 in ber_bvarray_free (a=0x9049bb8) at memory.c:741
> #9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134

The problem seems to be here; apparently, the a->a_nvals array is being 
incorrectly freed.  If you have those core files around, it would be 
interesting to see the contents of those attributes, e.g.

p *a
p a->a_desc[0]
p a->a_flags
p a->a_nvals[0]

from within frame 9.

> #10 0x0807983b in attrs_free (a=0xb62d41bc) at attr.c:196
> #11 0x0807c059 in entry_clean (e=0xb5acfdbc) at entry.c:504
> #12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
>     at search.c:354

Also, it would be interesting to see the parameters of the search; e.g.

p op->o_req_ndn
p op->o_request.oq_search
p op->o_request.oq_search.rs_filterstr
p op->o_request.oq_search.rs_attrs[0] (unless NULL)

from within frame 12

> #13 0x080f2b02 in overlay_op_walk (op=0xa680e5d8, rs=0xb5ad10e0, 
>     which=op_search, oi=0x8fa9208, on=0x0) at backover.c:669
> #14 0x080f2d0a in over_op_func (op=0xa680e5d8, rs=0xb5ad10e0, which=op_search)
>     at backover.c:721
> #15 0x080f2dae in over_op_search (op=0xa680e5d8, rs=0xb5ad10e0)
>     at backover.c:743
> #16 0x080720d6 in fe_op_search (op=0xa680e5d8, rs=0xb5ad10e0) at search.c:366
> #17 0x08071a2c in do_search (op=0xa680e5d8, rs=0xb5ad10e0) at search.c:217
> #18 0x0806e705 in connection_operation (ctx=0xb5ad11d0, arg_v=0xa680e5d8)
>     at connection.c:1097
> #19 0x0806ebdf in connection_read_thread (ctx=0xb5ad11d0, argv=0x13)
>     at connection.c:1223
> #20 0x0819ecad in ldap_int_thread_pool_wrapper (xpool=0x8f86fa0) at tpool.c:663
> #21 0x00b3346b in start_thread () from /lib/libpthread.so.0
> #22 0x00a90dbe in clone () from /lib/libc.so.6
> (gdb) 
> 
> I have seen quite a few of these backtraces, although sometimes coming
> through slap_sl_free (and a couple of other places) instead of
> ber_bvarray_free_x.
> 
> Let me know what further information I can provide.

Finally, it would be great to see your slapd.conf.

p.


Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
-----------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Fax:     +39 0382 476497
Email:   ando@sys-net.it
-----------------------------------

Comment 2 toby@inf.ed.ac.uk 2009-03-16 10:17:39 UTC
On 14 Mar 2009, at 15:49, Pierangelo Masarati wrote:

Hello,

Thanks for your quick reply.

>> (gdb) bt
>> #0  0x00b8d402 in __kernel_vsyscall ()
>> #1  0x009e8d20 in raise () from /lib/libc.so.6
>> #2  0x009ea631 in abort () from /lib/libc.so.6
>> #3  0x00a20e6b in __libc_message () from /lib/libc.so.6
>> #4  0x00a28b16 in _int_free () from /lib/libc.so.6
>> #5  0x00a2c070 in free () from /lib/libc.so.6
>> #6  0x081d5746 in ber_memfree_x (p=0xa6809478, ctx=0x0) at memory.c: 
>> 152
>> #7  0x081d630c in ber_bvarray_free_x (a=0x9049bb8, ctx=0x0) at  
>> memory.c:731
>> #8  0x081d6343 in ber_bvarray_free (a=0x9049bb8) at memory.c:741
>> #9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134
>
> The problem seems to be here; apparently, the a->a_nvals array is  
> being incorrectly freed.  If you have those core files around, it  
> would be interesting to see the contents of those attributes, e.g.
>
> p *a
> p a->a_desc[0]
> p a->a_flags
> p a->a_nvals[0]
>
> from within frame 9.


(gdb) up
#9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134
134				ber_bvarray_free( a->a_nvals );
(gdb) p *a
$1 = {a_desc = 0x8faa150, a_vals = 0x90487f0, a_nvals = 0x9049bb8,
   a_numvals = 631, a_flags = 0, a_next = 0x0}
(gdb) p a->a_desc[0]
$2 = {ad_next = 0x0, ad_type = 0x8fbba20, ad_cname = {bv_len = 9,
     bv_val = 0x8fba5d8 "memberUid"}, ad_tags = {bv_len = 0, bv_val =  
0x0},
   ad_flags = 0}
(gdb) p a->a_flags
$3 = 0
(gdb) p a->a_nvals[0]
$4 = {bv_len = 8, bv_val = 0x905aac0 "v1ggoodh"}
(gdb)


>> #10 0x0807983b in attrs_free (a=0xb62d41bc) at attr.c:196
>> #11 0x0807c059 in entry_clean (e=0xb5acfdbc) at entry.c:504
>> #12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
>>    at search.c:354
>
> Also, it would be interesting to see the parameters of the search;  
> e.g.
>
> p op->o_req_ndn
> p op->o_request.oq_search
> p op->o_request.oq_search.rs_filterstr
> p op->o_request.oq_search.rs_attrs[0] (unless NULL)
>
> from within frame 12


(gdb) up
#12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
     at search.c:354
354					entry_clean( &ent );
(gdb) p op->o_req_ndn
$5 = {bv_len = 24, bv_val = 0xb44ce1e4 "dc=inf,dc=ed,dc=ac,dc=uk"}
(gdb) p op->o_request.oq_search
$6 = {rs_scope = 2, rs_deref = 0, rs_slimit = 24576, rs_tlimit = 3600,
   rs_limit = 0x8fa88b8, rs_attrsonly = 0, rs_attrs = 0xb44ce904,
   rs_filter = 0xb44ce86c, rs_filterstr = {bv_len = 112,
     bv_val = 0xb44ce87c "(&(objectClass=posixGroup)(| 
(memberUid=vlavrenk) 
(uniqueMember=uid=vlavrenk,ou=people,dc=inf,dc=ed,dc=ac,dc=uk)))"}}
(gdb) p op->o_request.oq_search.rs_filterstr
$7 = {bv_len = 112,
   bv_val = 0xb44ce87c "(&(objectClass=posixGroup)(| 
(memberUid=vlavrenk) 
(uniqueMember=uid=vlavrenk,ou=people,dc=inf,dc=ed,dc=ac,dc=uk)))"}
(gdb) p op->o_request.oq_search.rs_attrs[0]
$8 = {an_name = {bv_len = 9, bv_val = 0x907b012 "gidNumber"},
   an_desc = 0x8f85930, an_oc_exclude = 0, an_oc = 0x0}
(gdb)


> Finally, it would be great to see your slapd.conf.


No problem, this is the database, back-ldap, pcache part of it:

database                ldap
suffix                  dc=inf,dc=ed,dc=ac,dc=uk
rootdn                  uid=ldaprep/ 
hostname.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth
uri                     "ldap://server1.inf.ed.ac.uk/,ldap:// 
server2.inf.ed.ac.uk"
idassert-bind           mode=none
                         bindmethod=sasl
                         saslmech=GSSAPI
idassert-authzFrom      "dn:*"

overlay                 pcache

proxycache              bdb 5000 4 500 300
proxycachequeries       10000

proxyattrset            0 cn userPassword memberUid uniqueMember  
gidNumber
proxyattrset            1 uid userPassword uidNumber gidNumber cn  
homeDirectory
loginShell gecos description objectClass afsHomeDirectory primaryRoles  
secondary
Roles
proxyattrset            2 cn nisNetgroupTriple memberNisNetgroup
proxyattrset            3 "*"

proxytemplate           (uid=) 1 1800 1800
proxytemplate           (&(objectClass=)(uid=)) 1 1800 1800
proxytemplate           (&(objectClass=)(cn=)) 0 1800 1800
proxytemplate           (&(objectClass=)(cn=)) 2 1800 1800
proxytemplate           (&(objectClass=)(|(memberUid=) 
(uniqueMember=))) 0 1800 1800
proxytemplate           (&(objectClass=)(uniqueMember=)) 0 1800 1800
proxytemplate           (&(objectClass=)(memberUid=)) 0 1800 1800
proxytemplate           (&(objectClass=)(uidNumber=)) 1 1800 1800
proxytemplate           (&(objectClass=)(gidNumber=)) 0 1800 1800
proxytemplate           (&(objectClass=)(amdmapName=)(amdmapKey=)) 3  
1800 1800
proxytemplate           (&(objectClass=)(uid=)) 3 1800 1800

directory               /var/openldap-data

checkpoint    128 60
cachesize    10000
idlcachesize    10000

dbconfig                set_cachesize 0 16777216 1
dbconfig                set_lg_regionmax 262144
dbconfig                set_lg_bsize 2097152


If there's anything else that would be useful, let me know.  As I  
mentioned, I have a number of other core files, including some that  
look the same as this one.

Cheers
Toby


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 3 ando@openldap.org 2009-03-16 19:33:40 UTC
toby@inf.ed.ac.uk wrote:
> On 14 Mar 2009, at 15:49, Pierangelo Masarati wrote:
> 
> Hello,
> 
> Thanks for your quick reply.
> 
>>> (gdb) bt
>>> #0  0x00b8d402 in __kernel_vsyscall ()
>>> #1  0x009e8d20 in raise () from /lib/libc.so.6
>>> #2  0x009ea631 in abort () from /lib/libc.so.6
>>> #3  0x00a20e6b in __libc_message () from /lib/libc.so.6
>>> #4  0x00a28b16 in _int_free () from /lib/libc.so.6
>>> #5  0x00a2c070 in free () from /lib/libc.so.6
>>> #6  0x081d5746 in ber_memfree_x (p=0xa6809478, ctx=0x0) at memory.c: 
>>> 152
>>> #7  0x081d630c in ber_bvarray_free_x (a=0x9049bb8, ctx=0x0) at  
>>> memory.c:731
>>> #8  0x081d6343 in ber_bvarray_free (a=0x9049bb8) at memory.c:741
>>> #9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134
>> The problem seems to be here; apparently, the a->a_nvals array is  
>> being incorrectly freed.  If you have those core files around, it  
>> would be interesting to see the contents of those attributes, e.g.
>>
>> p *a
>> p a->a_desc[0]
>> p a->a_flags
>> p a->a_nvals[0]
>>
>> from within frame 9.
> 
> 
> (gdb) up
> #9  0x08079757 in attr_clean (a=0xb62d41bc) at attr.c:134
> 134				ber_bvarray_free( a->a_nvals );
> (gdb) p *a
> $1 = {a_desc = 0x8faa150, a_vals = 0x90487f0, a_nvals = 0x9049bb8,
>    a_numvals = 631, a_flags = 0, a_next = 0x0}
> (gdb) p a->a_desc[0]
> $2 = {ad_next = 0x0, ad_type = 0x8fbba20, ad_cname = {bv_len = 9,
>      bv_val = 0x8fba5d8 "memberUid"}, ad_tags = {bv_len = 0, bv_val =  
> 0x0},
>    ad_flags = 0}
> (gdb) p a->a_flags
> $3 = 0
> (gdb) p a->a_nvals[0]
> $4 = {bv_len = 8, bv_val = 0x905aac0 "v1ggoodh"}
> (gdb)
> 
> 
>>> #10 0x0807983b in attrs_free (a=0xb62d41bc) at attr.c:196
>>> #11 0x0807c059 in entry_clean (e=0xb5acfdbc) at entry.c:504
>>> #12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
>>>    at search.c:354
>> Also, it would be interesting to see the parameters of the search;  
>> e.g.
>>
>> p op->o_req_ndn
>> p op->o_request.oq_search
>> p op->o_request.oq_search.rs_filterstr
>> p op->o_request.oq_search.rs_attrs[0] (unless NULL)
>>
>> from within frame 12
> 
> 
> (gdb) up
> #12 0x0811ec17 in ldap_back_search (op=0xa680e5d8, rs=0xb5ad10e0)
>      at search.c:354
> 354					entry_clean( &ent );
> (gdb) p op->o_req_ndn
> $5 = {bv_len = 24, bv_val = 0xb44ce1e4 "dc=inf,dc=ed,dc=ac,dc=uk"}
> (gdb) p op->o_request.oq_search
> $6 = {rs_scope = 2, rs_deref = 0, rs_slimit = 24576, rs_tlimit = 3600,
>    rs_limit = 0x8fa88b8, rs_attrsonly = 0, rs_attrs = 0xb44ce904,
>    rs_filter = 0xb44ce86c, rs_filterstr = {bv_len = 112,
>      bv_val = 0xb44ce87c "(&(objectClass=posixGroup)(| 
> (memberUid=vlavrenk) 
> (uniqueMember=uid=vlavrenk,ou=people,dc=inf,dc=ed,dc=ac,dc=uk)))"}}
> (gdb) p op->o_request.oq_search.rs_filterstr
> $7 = {bv_len = 112,
>    bv_val = 0xb44ce87c "(&(objectClass=posixGroup)(| 
> (memberUid=vlavrenk) 
> (uniqueMember=uid=vlavrenk,ou=people,dc=inf,dc=ed,dc=ac,dc=uk)))"}
> (gdb) p op->o_request.oq_search.rs_attrs[0]
> $8 = {an_name = {bv_len = 9, bv_val = 0x907b012 "gidNumber"},
>    an_desc = 0x8f85930, an_oc_exclude = 0, an_oc = 0x0}
> (gdb)
> 
> 
>> Finally, it would be great to see your slapd.conf.
> 
> 
> No problem, this is the database, back-ldap, pcache part of it:
> 
> database                ldap
> suffix                  dc=inf,dc=ed,dc=ac,dc=uk
> rootdn                  uid=ldaprep/ 
> hostname.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth
> uri                     "ldap://server1.inf.ed.ac.uk/,ldap:// 
> server2.inf.ed.ac.uk"
> idassert-bind           mode=none
>                          bindmethod=sasl
>                          saslmech=GSSAPI
> idassert-authzFrom      "dn:*"
> 
> overlay                 pcache
> 
> proxycache              bdb 5000 4 500 300
> proxycachequeries       10000
> 
> proxyattrset            0 cn userPassword memberUid uniqueMember  
> gidNumber
> proxyattrset            1 uid userPassword uidNumber gidNumber cn  
> homeDirectory
> loginShell gecos description objectClass afsHomeDirectory primaryRoles  
> secondary
> Roles
> proxyattrset            2 cn nisNetgroupTriple memberNisNetgroup
> proxyattrset            3 "*"
> 
> proxytemplate           (uid=) 1 1800 1800
> proxytemplate           (&(objectClass=)(uid=)) 1 1800 1800
> proxytemplate           (&(objectClass=)(cn=)) 0 1800 1800
> proxytemplate           (&(objectClass=)(cn=)) 2 1800 1800
> proxytemplate           (&(objectClass=)(|(memberUid=) 
> (uniqueMember=))) 0 1800 1800
> proxytemplate           (&(objectClass=)(uniqueMember=)) 0 1800 1800
> proxytemplate           (&(objectClass=)(memberUid=)) 0 1800 1800
> proxytemplate           (&(objectClass=)(uidNumber=)) 1 1800 1800
> proxytemplate           (&(objectClass=)(gidNumber=)) 0 1800 1800
> proxytemplate           (&(objectClass=)(amdmapName=)(amdmapKey=)) 3  
> 1800 1800
> proxytemplate           (&(objectClass=)(uid=)) 3 1800 1800
> 
> directory               /var/openldap-data
> 
> checkpoint    128 60
> cachesize    10000
> idlcachesize    10000
> 
> dbconfig                set_cachesize 0 16777216 1
> dbconfig                set_lg_regionmax 262144
> dbconfig                set_lg_bsize 2097152
> 
> 
> If there's anything else that would be useful, let me know.  As I  
> mentioned, I have a number of other core files, including some that  
> look the same as this one.

I did my best to reproduce your setup, but I got no crashes.  One thing 
I'd like to ask is: I see some non-standard attributes in the pcache 
configuration.  Can you tell whether there might occur some schema 
inconsistencies between the remote server and the proxy?  Also, can you 
tell how often crashes occur, or any detail about the type of operation, 
the workload and so?  I also disabled slab allocation, so that true 
mallocs occur all times, and valgrind is not showing any memory issue.

p.


Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
-----------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Fax:     +39 0382 476497
Email:   ando@sys-net.it
-----------------------------------

Comment 4 toby@inf.ed.ac.uk 2009-03-17 15:14:42 UTC
> I did my best to reproduce your setup, but I got no crashes.  One
> thing I'd like to ask is: I see some non-standard attributes in the
> pcache configuration.  Can you tell whether there might occur some
> schema inconsistencies between the remote server and the proxy?

No, they're the same - the schema files are distributed via RPM and
configured in slapd.conf with the same bit of configuration code.

> Also, can you tell how often crashes occur, or any detail about the
> type of operation, the workload and so?

The machines I'm currently testing with are all in a student lab,
running condor jobs when quiet, so a bit difficult to predict - I
haven't seen any discernible pattern so far - I've seen crashes on
busy machines and quiet ones.

For the core file we're looking at here, this machine had answered 9
queries in the second prior to the crash, but had been quiet for 3
minutes prior to that after it had removed stale entries from the
cache.  I can attach a portion of the slapd log if it would be useful.

As for how often the crashes occur, these 24 lab machines have been
running 2.4.15 built with no optimisation since 27th Feb and have
produced 18 core files in that time.  Of these, approximately half
have been assertion failures for free(p) in memory.c:152.

I've seen 3 cases of segfaults in pcache_filter_cmp, as mentioned in
ITS#5756, but I'll submit non-optimised backtraces to that ticket,
rather than confuse the issue here.


> I also disabled slab allocation, so that true mallocs occur all
> times, and valgrind is not showing any memory issue.


The difficulty is in trying to reproduce this - my own desktop machine
is running with the same setup and has never crashed, despite my best
efforts.

Cheers
Toby

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 5 toby@inf.ed.ac.uk 2009-04-03 14:30:14 UTC
Hi there,  I've been collecting a bit more information on these
crashes...

Since we started running 2.4.15:

26 core files, of these, 15 have crashed in memory.c:152

It looks like the majority of these were removing stale entries from
the cache when slapd crashed, which wasn't the case with the original
ticket I submitted.  Also, the last thing logged by slapd in quite a
few of these cases was along this lines of ...

Mar 30 20:50:16 albany slapd[14239]: DELETING ENTRY TEMPLATE=db78c810-0c2a-49de-96af-f5796bbe8ca3 
Mar 30 20:50:17 albany last message repeated 14 times

... I don't know whether this is useful?

Here's the details of the latest crash, slapd log and backtrace:

Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a5140 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a5140 QUERIES-- 0 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a5140 
Mar 30 20:50:16 albany slapd[14239]: DELETING ENTRY TEMPLATE=e7ccb20c-d263-4d6c-ba23-62afe55f0158 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=1 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 30 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 15 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 29 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 14 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 28 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 13 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 27 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 12 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 26 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 11 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 25 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 10 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 24 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 9 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 23 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 8 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 22 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 7 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 21 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 6 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 20 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 5 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 19 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 4 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 18 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 3 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 17 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 2 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 16 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 1 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 15 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e90 QUERIES-- 0 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e90 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, SIZE=0 
Mar 30 20:50:16 albany slapd[14239]: STORED QUERIES = 14 
Mar 30 20:50:16 albany slapd[14239]: STALE QUERY REMOVED, CACHE =22 entries 
Mar 30 20:50:16 albany slapd[14239]: Lock CR index = 0x83a4e28 
Mar 30 20:50:16 albany slapd[14239]: TEMPLATE 0x83a4e28 QUERIES-- 0 
Mar 30 20:50:16 albany slapd[14239]: Unlock CR index = 0x83a4e28 
Mar 30 20:50:16 albany slapd[14239]: DELETING ENTRY TEMPLATE=db78c810-0c2a-49de-96af-f5796bbe8ca3 
Mar 30 20:50:17 albany last message repeated 14 times


Program terminated with signal 6, Aborted.
#0  0x00185402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00185402 in __kernel_vsyscall ()
#1  0x00476d20 in raise () from /lib/libc.so.6
#2  0x00478631 in abort () from /lib/libc.so.6
#3  0x004aee6b in __libc_message () from /lib/libc.so.6
#4  0x004b6b16 in _int_free () from /lib/libc.so.6
#5  0x004ba070 in free () from /lib/libc.so.6
#6  0x081d5746 in ber_memfree_x (p=0x841e688, ctx=0x0) at memory.c:152
#7  0x081d630c in ber_bvarray_free_x (a=0x843af58, ctx=0x0) at memory.c:731
#8  0x081d6343 in ber_bvarray_free (a=0x843af58) at memory.c:741
#9  0x0807979c in attr_clean (a=0xb62934ec) at attr.c:146
#10 0x0807983b in attrs_free (a=0xb62934ec) at attr.c:196
#11 0x0807c059 in entry_clean (e=0xb6727584) at entry.c:504
#12 0x0807c080 in entry_free (e=0xb6727584) at entry.c:514
#13 0x08140030 in bdb_entry_return (e=0xb6727584) at id2entry.c:229
#14 0x08134c0c in bdb_cache_delete_cleanup (cache=0x838956c, ei=0xb332a060)
     at cache.c:1316
#15 0x0813a824 in bdb_delete (op=0xb32fed5c, rs=0xb32febec) at delete.c:575
#16 0x08177be1 in remove_query_data (op=0xb32fed5c, rs=0xb32fecf8,
     query_uuid=0x83ff288) at pcache.c:1460
#17 0x0817af8f in consistency_check (ctx=0xb32ff1d0, arg=0x83f57e8)
     at pcache.c:2611
#18 0x0819ecad in ldap_int_thread_pool_wrapper (xpool=0x8366f90) at tpool.c:663
#19 0x005c746b in start_thread () from /lib/libpthread.so.0
#20 0x0051edbe in clone () from /lib/libc.so.6
(gdb)


Also, I'll include details of another type of crash that I've seen 3
times since running 2.4.15 in the lab, at the risk of overloading
information into this ticket.  I'm including it here as the slapd
behaviour at time of crash was similar to that noted above,
i.e. removing stale entries from the cache:

Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac9188 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac9188 QUERIES-- 2 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac9188 
Mar 13 04:37:48 pemberton slapd[6767]: DELETING ENTRY TEMPLATE=febcaede-1a2f-42ef-af2e-9206f0a4f678 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=1 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 32 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 15 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 31 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 14 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 30 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 13 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 29 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 12 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 28 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 11 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 27 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8eb0 QUERIES-- 10 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8eb0 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=0 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 26 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =19 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8e48 
Mar 13 04:37:48 pemberton slapd[6767]: TEMPLATE 0x9ac8e48 QUERIES-- 1 
Mar 13 04:37:48 pemberton slapd[6767]: Unlock CR index = 0x9ac8e48 
Mar 13 04:37:48 pemberton slapd[6767]: DELETING ENTRY TEMPLATE=3eff8b71-4ca6-4918-b760-c32023dbe215 
Mar 13 04:37:48 pemberton last message repeated 12 times
Mar 13 04:37:48 pemberton slapd[6767]: REMOVING TEMP ATTR : TEMPLATE=3eff8b71-4ca6-4918-b760-c32023dbe215 
Mar 13 04:37:48 pemberton last message repeated 2 times
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, SIZE=13 
Mar 13 04:37:48 pemberton slapd[6767]: STORED QUERIES = 25 
Mar 13 04:37:48 pemberton slapd[6767]: STALE QUERY REMOVED, CACHE =6 entries 
Mar 13 04:37:48 pemberton slapd[6767]: Lock CR index = 0x9ac8da8

Program terminated with signal 11, Segmentation fault.
#0  0x081772d2 in remove_from_template (qc=0xaef209a8, template=0x9ac8da8)
     at pcache.c:1297
1297			qc->prev->next = qc->next;
(gdb) bt
#0  0x081772d2 in remove_from_template (qc=0xaef209a8, template=0x9ac8da8)
     at pcache.c:1297
#1  0x0817adc8 in consistency_check (ctx=0xaeeff1d0, arg=0x9b19860)
     at pcache.c:2594
#2  0x0819ecad in ldap_int_thread_pool_wrapper (xpool=0x9a8afa0) at tpool.c:663
#3  0x00bb046b in start_thread () from /lib/libpthread.so.0
#4  0x00b07dbe in clone () from /lib/libc.so.6
(gdb) 
(gdb) p qc->prev->next
Cannot access memory at address 0x54
(gdb) p qc->prev
$2 = (struct cached_query_s *) 0x30
(gdb) p *qc->prev
Cannot access memory at address 0x30
(gdb)


As always, if there's any more information that would be useful, let
me know...

Cheers
Toby

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 6 Howard Chu 2009-04-05 02:20:00 UTC
toby@inf.ed.ac.uk wrote:
> Hi there,  I've been collecting a bit more information on these
> crashes...
>
> Since we started running 2.4.15:
>
> 26 core files, of these, 15 have crashed in memory.c:152
>
> It looks like the majority of these were removing stale entries from
> the cache when slapd crashed, which wasn't the case with the original
> ticket I submitted.  Also, the last thing logged by slapd in quite a
> few of these cases was along this lines of ...
>
> Mar 30 20:50:16 albany slapd[14239]: DELETING ENTRY TEMPLATE=db78c810-0c2a-49de-96af-f5796bbe8ca3
> Mar 30 20:50:17 albany last message repeated 14 times
>
> ... I don't know whether this is useful?
>
> Here's the details of the latest crash, slapd log and backtrace:

> Program terminated with signal 6, Aborted.
> #0  0x00185402 in __kernel_vsyscall ()
> (gdb) bt

When a program aborts it will print an error message on stderr. It would be 
useful to have that message.

Can you please update to RE24, the 2.4.16 candidate, and see if the behavior 
has changed? Also, can you please be sure to compile without optimization (or 
with frame pointers intact; gcc -fno-omit-frame-pointer); there seem to be a 
few functions missing in these stack traces.

Also grab the latest overlays/pcache.c (1.168). It probably won't fix things, 
but it should make the crashes a little more obvious.

> #0  0x00185402 in __kernel_vsyscall ()
> #1  0x00476d20 in raise () from /lib/libc.so.6
> #2  0x00478631 in abort () from /lib/libc.so.6
> #3  0x004aee6b in __libc_message () from /lib/libc.so.6
> #4  0x004b6b16 in _int_free () from /lib/libc.so.6
> #5  0x004ba070 in free () from /lib/libc.so.6
> #6  0x081d5746 in ber_memfree_x (p=0x841e688, ctx=0x0) at memory.c:152
> #7  0x081d630c in ber_bvarray_free_x (a=0x843af58, ctx=0x0) at memory.c:731
> #8  0x081d6343 in ber_bvarray_free (a=0x843af58) at memory.c:741
> #9  0x0807979c in attr_clean (a=0xb62934ec) at attr.c:146
> #10 0x0807983b in attrs_free (a=0xb62934ec) at attr.c:196
> #11 0x0807c059 in entry_clean (e=0xb6727584) at entry.c:504
> #12 0x0807c080 in entry_free (e=0xb6727584) at entry.c:514
> #13 0x08140030 in bdb_entry_return (e=0xb6727584) at id2entry.c:229
> #14 0x08134c0c in bdb_cache_delete_cleanup (cache=0x838956c, ei=0xb332a060)
>       at cache.c:1316
> #15 0x0813a824 in bdb_delete (op=0xb32fed5c, rs=0xb32febec) at delete.c:575
> #16 0x08177be1 in remove_query_data (op=0xb32fed5c, rs=0xb32fecf8,
>       query_uuid=0x83ff288) at pcache.c:1460
> #17 0x0817af8f in consistency_check (ctx=0xb32ff1d0, arg=0x83f57e8)
>       at pcache.c:2611
> #18 0x0819ecad in ldap_int_thread_pool_wrapper (xpool=0x8366f90) at tpool.c:663
> #19 0x005c746b in start_thread () from /lib/libpthread.so.0
> #20 0x0051edbe in clone () from /lib/libc.so.6
> (gdb)
>
>
> Also, I'll include details of another type of crash that I've seen 3
> times since running 2.4.15 in the lab, at the risk of overloading
> information into this ticket.  I'm including it here as the slapd
> behaviour at time of crash was similar to that noted above,
> i.e. removing stale entries from the cache:

Yeah, based on the debug output it's probably related.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 toby@inf.ed.ac.uk 2009-04-07 12:33:24 UTC
> When a program aborts it will print an error message on stderr. It
> would be useful to have that message.

Long since disappeared for this case, I'm afraid.  I'll try and get it
in the future.

> Can you please update to RE24, the 2.4.16 candidate, and see if the
> behavior has changed? Also, can you please be sure to compile
> without optimization (or with frame pointers intact; gcc
> -fno-omit-frame-pointer); there seem to be a few functions missing
> in these stack traces.
>
> Also grab the latest overlays/pcache.c (1.168). It probably won't
> fix things, but it should make the crashes a little more obvious.

OK, I've built 2.4.16 (which I see has the new pcache.c) with no
optimisation (-O0 and -fno-omit-frame-pointer) and will put in place
in our test lab.

Cheers
Toby

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 8 toby@inf.ed.ac.uk 2009-04-09 13:19:07 UTC
OK, this is my first crash since moving to 2.4.16:

Slightly different from the previous ones, but is coming through
pcache.c:remove_from_template and, according to slapd log, was
removing queries from cache when crash occurred:

Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 5 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 26 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 4 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 25 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 3 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 24 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 2 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 23 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 1 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 22 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 0 
Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0 
Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 21 
Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
entries 
Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e6eb0


Here's the backtrace:

Program terminated with signal 11, Segmentation fault.
#0  0x081764ed in pcache_query_cmp (v1=0xb1f0e900, v2=0x83e58955)
     at pcache.c:694
694    return pcache_filter_cmp( q1->first, q2->first );
(gdb) bt
#0  0x081764ed in pcache_query_cmp (v1=0xb1f0e900, v2=0x83e58955)
     at pcache.c:694
#1  0x08199b6d in tavl_delete (root=0x825d7d0, data=0xb1f0e900,
     fcmp=0x81764d8 <pcache_query_cmp>) at tavl.c:202
#2  0x08177ab4 in remove_from_template (qc=0xb1f0e900,
template=0x94e6eb0)
     at pcache.c:1302
#3  0x0817b57d in consistency_check (ctx=0xb44201d0, arg=0x9537c50)
     at pcache.c:2597
#4  0x0819fd01 in ldap_int_thread_pool_wrapper (xpool=0x94a8fa0) at
tpool.c:663
#5  0x009c546b in start_thread () from /lib/libpthread.so.0
#6  0x0091cdbe in clone () from /lib/libc.so.6
(gdb) 
(gdb) p q1->first
$3 = (Filter *) 0xb1f0d4c0
(gdb) p q2->first
Cannot access memory at address 0x83e58959

Let me know what else would be useful from this backtrace.

Cheers
Toby


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 9 Howard Chu 2009-04-10 01:02:15 UTC
toby@inf.ed.ac.uk wrote:
> OK, this is my first crash since moving to 2.4.16:
>
> Slightly different from the previous ones, but is coming through
> pcache.c:remove_from_template and, according to slapd log, was
> removing queries from cache when crash occurred:
>
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 5
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 26
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 4
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 25
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 3
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 24
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 2
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 23
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 1
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 22
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: TEMPLATE 0x94e70f0 QUERIES-- 0
> Apr  9 05:05:18 rockingham slapd[6662]: Unlock CR index = 0x94e70f0
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, SIZE=0
> Apr  9 05:05:18 rockingham slapd[6662]: STORED QUERIES = 21
> Apr  9 05:05:18 rockingham slapd[6662]: STALE QUERY REMOVED, CACHE =17
> entries
> Apr  9 05:05:18 rockingham slapd[6662]: Lock CR index = 0x94e6eb0
>
>
> Here's the backtrace:
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x081764ed in pcache_query_cmp (v1=0xb1f0e900, v2=0x83e58955)
>       at pcache.c:694
> 694    return pcache_filter_cmp( q1->first, q2->first );
> (gdb) bt
> #0  0x081764ed in pcache_query_cmp (v1=0xb1f0e900, v2=0x83e58955)
>       at pcache.c:694
> #1  0x08199b6d in tavl_delete (root=0x825d7d0, data=0xb1f0e900,
>       fcmp=0x81764d8<pcache_query_cmp>) at tavl.c:202
> #2  0x08177ab4 in remove_from_template (qc=0xb1f0e900,
> template=0x94e6eb0)
>       at pcache.c:1302
> #3  0x0817b57d in consistency_check (ctx=0xb44201d0, arg=0x9537c50)
>       at pcache.c:2597
> #4  0x0819fd01 in ldap_int_thread_pool_wrapper (xpool=0x94a8fa0) at
> tpool.c:663
> #5  0x009c546b in start_thread () from /lib/libpthread.so.0
> #6  0x0091cdbe in clone () from /lib/libc.so.6
> (gdb)
> (gdb) p q1->first
> $3 = (Filter *) 0xb1f0d4c0
> (gdb) p q2->first
> Cannot access memory at address 0x83e58959
>
> Let me know what else would be useful from this backtrace.

Nothing obvious is jumping out here. Can you run a few of these slapds with 
libefence or valgrind?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 10 toby@inf.ed.ac.uk 2009-04-13 07:26:50 UTC
On 10 Apr 2009, at 02:02, Howard Chu wrote:

> Nothing obvious is jumping out here. Can you run a few of these  
> slapds with libefence or valgrind?

I'll see what I can do.


>> (gdb) p q1->first
>> $3 = (Filter *) 0xb1f0d4c0
>> (gdb) p q2->first
>> Cannot access memory at address 0x83e58959
>
> print *q1
>
> up
> print *root

(gdb) p *q1
$1 = {filter = 0xa, first = 0xb1f0d4c0, qbase = 0x825d7d0, scope = 0,
  q_uuid = {bv_len = 2985341416, bv_val = 0x1d <Address 0x1d out of  
bounds>},
  q_sizelimit = 2147483647, qtemp = 0xb1f0d4c0, expiry_time = 136697880,
  next = 0xb1f10cd8, prev = 0xb1f031e0, lru_up = 0x1d, lru_down =  
0x69647062,
  rwlock = {__data = {__lock = 1852386930, __nr_readers = 1684352614,
      __readers_wakeup = 778264878, __writer_wakeup = 27509,
      __nr_readers_queued = 21, __nr_writers_queued = 2985295944,
      __flags = 72 'H', __shared = 0 '\0', __pad1 = 240 '?', __pad2 =  
177 '?',
      __writer = 0},
    __size = "r.inf.ed.ac.uk\000\000\025\000\000\000H\000?H\000? 
\000\000\000", __align = 1852386930}}
(gdb) up
#1  0x08199b6d in tavl_delete (root=0x825d7d0, data=0xb1f0e900,
    fcmp=0x81764d8 <pcache_query_cmp>) at tavl.c:202
202			side = fcmp( data, p->avl_data );
(gdb) p *root
$2 = (Avlnode *) 0x81d8ac4
(gdb) p **root
$3 = {avl_data = 0x83e58955, avl_link = {0x7d8318ec, 0x24750008},
  avl_bits = "?D", avl_bf = 36 '$'}
(gdb)

Cheers
Toby


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Comment 11 Howard Chu 2009-06-23 00:26:18 UTC
changed notes
changed state Open to Closed
Comment 12 OpenLDAP project 2014-08-01 21:03:38 UTC
no feedback