[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: SEGV on syncRepl provider (ITS#3296)



New results from watchmalloc and further SEGVs...

This has been recurring fairly often (on the order of hours); the provider
SEGVs following a "SRCH attr=* objectClass structuralObjectClass entryCSN"
(ie syncRepl consumer) search. One consumer in particular appears to
instigate the SEGV just by restarting (consumer) slapd. So far, I've got
nine core files to look through.

A couple things to note:

1. At least three of the core files all have one thread with a stack trace:

#0  0x001c6dd8 in ber_bvarray_free_x (a=0x865e1028, ctx=0x0) at memory.c:727
#1  0x001c6ec0 in ber_bvarray_free (a=0x865e1028) at memory.c:736
#2  0x00081e60 in attr_free (a=0xed5f88) at attr.c:334
#3  0x00081ee8 in attrs_free (a=0xed5f88) at attr.c:59
#4  0x0014a538 in hdb_entry_return (e=0x4f5f18) at id2entry.c:169
#5  0x0013dbe8 in bdb_cache_lru_add () at index.c:324
#6  0x0013e5dc in hdb_cache_find_id (op=0x3f5478, tid=0x0, id=2012,
    eip=0xda33f968, islocked=0, locker=7, lock=0xda33f7fc) at cache.c:775
#7  0x0010e994 in hdb_do_search () at tools.c:288
#8  0x0010c52c in hdb_search () at tools.c:288
#9  0x00079958 in do_search (op=0x3f5478, rs=0xda3ffd58) at search.c:412
#10 0x00075e7c in connection_operation (ctx=0xda3ffe14, arg_v=0x3f5478)
    at connection.c:1073
#11 0x0017fb70 in ldap_int_thread_pool_wrapper (xpool=0x3394c0) at tpool.c:467


with the exact same line numbers.


2. I thought it would be a good idea to run through a debugging allocator.
Since this is Solaris, I set watchmalloc(3MALLOC) to WATCH,RW and started
the provider slapd. Starting the consumer to cause the SEGV, this gave a
different stack trace:

  4 process 160335      0xfee1d608 in _poll () from /usr/lib/libc.so.1
  3 process 94799      0xfee1f334 in _lwp_wait () from /usr/lib/libc.so.1
  2 process 291407      0xfee758fc in __lwp_park ()
   from /usr/lib/libthread.so.1
* 1 process 225871      0xff296b2c in __bam_c_refresh ()
   from /usr/local/lib/libdb-4.2.so

I assume the parked threads are boring, so:

Thread 1 (process 225871    ):
#0  0xff296b2c in __bam_c_refresh () from /usr/local/lib/libdb-4.2.so
#1  0xff2e5de8 in __db_cursor_int () from /usr/local/lib/libdb-4.2.so
#2  0xff2f4b80 in __db_cursor () from /usr/local/lib/libdb-4.2.so
#3  0xff2f4b38 in __db_cursor_pp () from /usr/local/lib/libdb-4.2.so
#4  0x00145c10 in hdb_dn2id_parent (op=0x3f52d8, txn=0x0, ei=0xda33f5f8,
    idp=0xda33f5ac) at dn2id.c:811
#5  0x0013d558 in hdb_cache_find_parent (op=0x3f52d8, txn=0x0, id=1605,
    res=0xda33f968) at cache.c:389
#6  0x0013df34 in hdb_cache_find_id (op=0x3f52d8, tid=0x0, id=1605,
    eip=0xda33f968, islocked=0, locker=63, lock=0xda33f7fc) at cache.c:650
#7  0x0010e994 in hdb_do_search () at tools.c:288
#8  0x0010c52c in hdb_search () at tools.c:288
#9  0x00079958 in do_search (op=0x3f52d8, rs=0xda3ffd58) at search.c:412
#10 0x00075e7c in connection_operation (ctx=0xda3ffe14, arg_v=0x3f52d8)
    at connection.c:1073
#11 0x0017fb70 in ldap_int_thread_pool_wrapper (xpool=0x3394c0) at tpool.c:467



I can recompile SleepyCat with symbols if necessary, although I suppose by
that point I should be talking to them instead...