[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: slapd-read hangs (ITS#3832)



ando@sys-net.it wrote:
> I've set the values after tuning the test on several architectures, but
> apparently the default cannot be always good.  I'm working at making the
> bind timeout configurable, so real deployments can be fine-tuned if
> required.  For the test, we can use "safe" defaults, e.g. very long
> timeouts, more threads, "nretries forever" and so.  I wouldn't modify the
> testers since an error condition of that type should not occur; in this
> case, rather than a bug in the software it indicates a poorly designed
> test (it's my fault, sigh).

Well, the test drew attention to code with potential problems, so I 
think it has served its purpose. It shows that we still need to pay more 
attention to how back-ldap/back-meta perform under heavy load. back-ldap 
may benefit from being rewritten in a fully asynchronous manner, 
detaching operations the way syncprov's persistent searches are 
detached. That will free up the worker threads sooner so the frontend 
can do more work. (But if the only work there is to do is to ask 
back-ldap to talk to the slow server, nothing really is gained...?)

As for this particular ITS, I can now quite often (but not 100% 
repeatable) reproduce the slapd-read hang. Here is the stack trace from 
one of the problem threads:

KERNEL32! 7c802542()
ldap_pvt_thread_mutex_lock(void * * 0x033848f8) line 175 + 14 bytes
ldap_result(ldap * 0x03384880, int 4, int 0, timeval * 0x0322f854, 
ldapmsg * * 0x0322f870) line 117 + 12 bytes
ldap_back_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 253 + 
35 bytes
glue_op_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 262 + 
19 bytes
overlay_op_walk(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c, int 2, 
slap_overinfo * 0x00ee3af8, slap_overinst * 0x00ee3bf8) line 482 + 17 bytes
over_op_func(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c, int 2) line 
542 + 28 bytes
over_op_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 564 + 
15 bytes
fe_op_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 349 + 19 
bytes
overlay_op_walk(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c, int 2, 
slap_overinfo * 0x00ef59a8, slap_overinst * 0x00000000) line 490 + 17 bytes
over_op_func(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c, int 2) line 
542 + 28 bytes
over_op_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 564 + 
15 bytes
do_search(slap_op * 0x00fa59b0, slap_rep * 0x0322fd8c) line 219 + 19 bytes
connection_operation(void * 0x0322fde8, void * 0x00fa59b0) line 1061 + 
18 bytes
ldap_int_thread_pool_wrapper(void * 0x0034c4c8) line 485 + 20 bytes
MSVCRT! 77c3a3b0()
KERNEL32! 7c80b50b()

Unfortunately, this is the only thread that's in the midst of an 
operation, but there are always two hung slapd-read processes. All the 
other worker threads are waiting in the pool, nothing else is happening. 
That would indicate that one of the replies actually got lost. And, a 
prior thread came in here and didn't unlock ld->ld_res_mutex before 
leaving.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/