[Date Prev][Date Next] [Chronological] [Thread] [Top]
Re: (ITS#4943) tpool.c pause vs. finish

To: openldap-its@OpenLDAP.org
Subject: Re: (ITS#4943) tpool.c pause vs. finish
From: hyc@symas.com
Date: Fri, 8 Jun 2007 22:38:29 GMT
h.b.furuseth@usit.uio.no wrote:
> OK, here's where we stand now, with my TODOs, questions and some
> new issues.  3+ pages.  I thought this would be a brief message:-(
> Hopefully I'll get to some of it this weekend, but no promises.
> 
> 
> TODOs (unless someone protests) and questions:
> 
> * When a thread is finishing, make it go active for context_reset().
>   Should make slapi module destructors safer, and I don't see that it
>   can hurt.

I don't understand this.

> * I'd like to rename thread contexts to tasks(?) and user contexts back
>   to contexts (as in RE23).  The current terminology is confusing.

I don't care about this either way.
> 
> * Save and restore the old data for key slap_sasl_bind, and just in case
>   (I can't tell if it is needed) smbk5pwd.
> 
>   Might as well do it everywhere we have setkey - do something - delete
>   key in one function call.  Since setkey(non-null) isn't supposed to
>   overwrite existing keys, the change should not be able to hurt.  It
>   could protect something I didn't think of or a future slapd change.
> 
>   API: the smallest change would be to call pool_getkey() first,
>   but I prefer a pool_swapkey(ctx, key, &data, &kfree) function.
>   Then in RE24, maybe remove the &kfree parameter to pool_getkey().
> 
>   Could omit the &kfree parameter to swapkey() if new data was
>   guaranteed non-NULL, but that'd require a malloc in bdb (silly says
>   Howard) or passing data in a &union{integer, ptr} (cumbersome say I).

Never use malloc unless absolutely necessary. We have enough problems 
with heap fragmentation already.

> * Axe semaphore code in tpool.c, the ldap_lazy_sem_init() call in
>   slapd/daemon.c, and with them LDAP_PVT_THREAD_POOL_SEM_LOAD_CONTROL
>   in ldap_pvt_thread.h and SLAP_SEM_LOAD_CONTROL in slap.h.

OK.

> * Is the ldap_pvt_thread_pool_context_reset() call in slapd/init.c
>   indeed safe - ie. its slap_sl_mem_destroy() call?  That's not during
>   startup, but in slap_destroy().  Slapd does various stuff after that.

Yes.

> * API changes:
> 
>   Do we include the assert(no multiple pools) in the import to RE23?

>   Remove keys from ltu_key[] _before_ calling ltk_free() with them?
>   That prevents functions called from ltk_free from seeing freed thread
>   data, but it can also break ltk_free functions that call such
>   functions _before_ freeing it.  Probably not for RE23.
> 
> * Scheduling: If several threads call pool_pause(), then once there is a
>   pause tpool does not schedule them all.  They could get handled then,
>   or another thread could undo the pause so tpool would wait to pause
>   again.  Is that deliberate?

I don't understand the question. The point of the pause is to prevent 
any other thread (in the pool) from running. Why should tpool schedule 
any other threads at this point?

> Dubious code, new issues, old ones I'm not sure we finished with:
> 
> * thread_keys[] still has problems.
> 
>   - pool_context() breaks if several ldap_pvt_thread_t bit patterns can
>     represent the same thread ID: TID_HASH() would give different hashes
>     for the same thread, and pool_context() stops searching when it hits
>     a slot with ctx == NULL.  (My previous bug report was a bit wrong.)

How can a single thread ID be represented by multiple bit patterns?

>     The best fix would be to use use real thread-specific data instead.
>     Just one key with the ctx for now, that minimizes the changes.  OTOH
>     it also means we'll do thread-specific key lookup twice - first in
>     pthread to get the ctx, and then in ltu_key[] to find our data.
>     Anyway, I said I'd do that later but might as well get on with it,
>     at least for pthreads.  Except I don't know if that's OS-dependent
>     enough that it should wait for RE24?

The best fix may be to just use the address of the thread's stack as the 
thread ID. That's actually perfect for our purpose.

>     When that's not implemented, I'd prefer to cast the thread ID to an
>     integer when that is safe, and hash that instead of its bytes.
>     Won't work for structs/unions, and is not formally guaranteed for
>     pointers or in case of pthreads even integers.  BTW, when it doesn't
>     work, we've got a problem in ITS#4983 (tls.c id_callback) as well.
> 
>     When that's not possible, we could search the entire table before
>     giving up.  Could make thread_keys[] smaller and reduce the maximum
>     allowed #threads on such hosts.  (What should the limit be?)
> 
>   - thread_keys is a poor hash table implementation with a poor hash
>     function.

Ideally it would not be a hash table at all, but a direct lookup based 
on the thread ID.

>     Oops - for caching, maybe I shouldn't have removed the thread id
>     from thread_keys[].  Or maybe the hash value should be stored there
>     instead, so we can compare with == instead of ldap_pvt_thread_equal.
> 
> * I wonder if there are long-lived ldap_pvt_thread_cond_wait(),
>   [r]mutex_lock() or rdwr_[rw]lock() calls which should make the thread
>   go inactive first (after waiting for any pause to finish), so that
>   (a) a pool_pause() will not block waiting for them, and
>   (b) all threads in the pool can't be blocked in long waits without
>       tpool knowing about it.

Long-lived lock calls would be a quite noticeable bug. This seems to be 
a non-issue.

> * back-bdb/tools.c uses pool_submit() to start long-lived threads:
>   bdb_tool_trickle_task() and bdb_tool_index_task().  They don't exit
>   until slapd is stopping, so they'll block slapd if a thread requests
>   a pause.  Also the pool thinks its has more available threads that it
>   does, so scheduled tasks might never be executed.
> 
>   Possibly that's not a problem with slaptools.  As complex as slapd is
>   now with overlays and plugins and whatnot, I have no idea.

Tool threads are completely separate from slapd threads. This is not an 
issue.

> * slapd/bconfig.c maintains its own idea of the max number of threads
>   in connection_pool_max and slap_tool_thread_max, which does not match
>   tpool's idea if the limit is <=0 or larger than LDAP_MAXTHR.
> 
>   Since slapd acts on its own count, I imagine that can cause breakage.
>   Simplest fix might be to set max #threads and then read it back,
>   I haven't checked.  Would need to return LDAP_MAXTHR instead of 0
>   for "as many as allowed".  Note however also the issue above:
>   tpool's idea of actually available threads can be too high too.

The practical limits are actually pretty low. Even if you're on a 1024 
processor machine with many terabytes of RAM, it's not a good idea to 
have many thousands of threads... Scheduler overhead would overshadow 
any useful work.

> * I said my new ltp_pause waits might be too aggressive.  Maybe I should
>   clarify: I don't see any instances where they can be dropped, but it's
>   possible that slapd expects pauses to be less aggressive somewhere.

I don't understand this comment.
> 
> About pool_submit() and the ltp_mutex unlock-relock:
> 
>   Some history and notes, just to finish up previous discussion.
>   Some possible TODOs, but they are not needed since it's not broken:
> 
>   Originally, pool_submit() unlocked ltp_mutex before thread_create()
>   and re-locked it afterwards for further maintenance.  That was not
>   because of the semaphore code.  It looks like it could work even in
>   non-threaded mode where thread_create(,,fn,arg) just called fn(arg),
>   though tpool.c was always wrapped in an #ifndef <--without-threads>.
> 
>   I don't think non-threaded would have worked anyway since rev 1.24
>   (apr 2003), which introduced thread_keys[].  Caller and callee would
>   have gotten the same thread ID, so tpool would have confused their
>   contexts.  Previously the new context was pushed onto ltp_active_list,
>   and would have temporarily hid the old context.
> 
>   That's also when thread_create() was moved after the ltp_mutex
>   re-lock, which would mean the mutex was locked while pool_wrapper()
>   was running and tried to re-lock the mutex.  And now, my rev.171 has
>   removed the unlock-relock.
> 
>   With that I can also remove the search for ctx in ltp_pending_list
>   when backing out of pool_submit, it will be at the head of that list.
> 
>   With tpool only supporting threaded mode, another cleanup could be for
>   pool_submit to start a thread before inserting ctx in ltp_pending_list
>   instead of after.  That done, it can fail if ltp_open_count is still
>   0, and we won't need the code for backing out of pool_submit().

OK.

>   Oh, the comment /* no open threads at all?!? */:  That's not strange.
>   Happens in the first pool_submit() call(s) if thread_create() fails.
>   Though in that case there was no danger of removing the wrong ctx
>   from ltp_pending_list after all (fixed in rev 1.63).

(Sure, but thread_create() should never fail if you have a sane 
configuration.)

-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/
Prev by Date: Re: (ITS#4943) tpool.c pause vs. finish
Next by Date: Re: (ITS#4965) slapd stops if access to cn=monitor is restricted
Index(es):
- Chronological
- Thread