[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#4943) tpool.c pause vs. finish

Some more comments, review, planned changes, and questions.

I guess I should summarize the bugs I've fixed in HEAD first, for
completeness.  (I was a bit too busy at the time.)  Restating the CVS
log comments, nothing new but maybe a bit clearer.  Possibly some of the
issues are tpool-only issues and can't happen the way slapd uses tpool.


rev 1.63:
ldap_pvt_thread_pool_submit(), after failed thread creation:
- ltp_pending_count++ should be --.
- should signal if there are no more threads, in case pool_destroy()
  is waiting for that.  (Not sure if that can happen.)
- The 'if(ltp_state...)' was spurious: It would always be true, and
  if it were not we'd get an eternal loop.

rev 1.64:
- would cancel states FINISHING and STOPPING, pool_resume() would set
  it back to RUNNING.
- could hang when waiting for another pool_pause(), since it did not
  signal ltp_pcond if it decremented ltp_active_count to 1.
- could update thread_keys[] during a pause while initializing and
  possibly while finishing,
- just after initializing, could handle a new ctx during a pause.
- would go on with destruction if pool_resume() was called, since it
  used an if(){wait} instead of while() and they use the same cond.

rev 1.65:
- pool_context() read thread_keys[] without mutex protection.
  pool_purgekey() too, but it expected a pause to protect the keys.
  pool_wrapper() protected with ltp_mutex while writing, and did not
  quite ensure a pause.

rev 1.66:
- thread_keys[] is a (poor) open-addessed hash table, but it lacked
  a "deleted item" mark.  So keys (contexts) could become hidden and
  re-emerge later if threads exited and then new ones were started.
- TID_HASH() did hash calculation in signed arithmetic, thus depending
  on friendly integer overflow.

rev 1.67:
- setkey/getkey expected ldap_int_thread_userctx_t.ltu_key[] to be
  NULL-terminated when not full.  purgekey() could set a NULL in
  the middle, hiding later keys until a new key overwrote the NULL.
API changes:
- setkey(data=NULL, kfree!=NULL) searched as if intended to reset
  the key, but updated by setting the key.  Now always updates.
- setkey(key=<not found>, data=NULL) could return either success or
  failure.  Now succeeds iff (data == NULL && kfree == NULL).

rev 1.68:
- ltp_max_count <= 0 meant allow an unlimited number of threads, but
  thread #LDAP_MAXTHR+1 and later would loop forever looking for a
  free thread_keys[] slot.  (Not fixed for multi-pool applications.)


I wrote in the tpool.c 1.66 -> 1.67 patch:

> API changes, need review - I'm not sure what's indented here:
> - setkey(data=NULL, kfree!=NULL) searched as if intended to reset
>   the key, but updated by setting the key.  Now always updates.

Right.  That matches the one place it can happen: In bdb_locker_id()
with setkey(,,data=(void *)(long)lockid,).  I don't know if lockid
can be 0, but it doesn't need to for data to be NULL.

But it would cleaner to ch_malloc an integer instead of the (void*)
cast.  Then it won't depend on integer->pointer->integer conversions
preserving the value either.
BTW, is there a reason for the cast to long?  lockid is u_int32_t,
and the getkey 'long' result is restored to a u_int32_t.

Howard Chu wrote:

>>   Also if ctx is invalid (another thread handled it) and a new
>>   pending request has gotten the now-unused ctx, then the
>>   'if (pool->ltp_open_count == 0)' code will free that new
>>   pending request instead of the old one.
> Most likely this is due to unlocking and relocking the mutex around
> the semaphore code. Axe that, remove the gratuitous unlock/relock,
> and then that problem goes away.

Was that an instruction/permission to axe it or not?
If not I'll just move the unlock/lock into the #ifdef for now.

> Thus far we've only had to worry about a single pool. If anyone can
> come up with a reason to need multiple pools, I guess we need to
> hear it.

I'll insert "static int initialized; assert(!initialized++);"
in pool_init() and see if anyone complains.

>> - setkey does not free the old data with ltk_free.  Possibly that's
>>   intentional, but if so that requires weird code elsewhere.
> Yes, that's intentional. ltk_free is only intended for cleanup if a key
> is still set by the time a thread exits. All of the keys we set are live
> for the entire life of a thread, they don't get re-assigned.

Even the (void*)slap_sasl_bind key?
lutil_passwd() looks like it can be recursive:  Register a password
scheme which finds an entry in LDAP and authenticates via that
entry's userPassword - which again could call lutil_passwd().

If that's legal, this code in be_isroot_pw()/slap_passwd_check()
can break:
    pool_setkey(, slap_sasl_bind, <ctx>, <freefunc> );
    ... /* slap_passwd_check() does access_allowed() here */
    pool_setkey(, slap_sasl_bind, NULL, NULL);
The final setkey would need to restore the old key.

I'm not sure of contrib/slapd-modules/smbk5pwd/smbk5pwd.c + some
overlay either.  rwm + back-relay which uses another smbk5pwd or
something.  (And I have no idea of the BDB cache stuff, but since
you wrote it I assume you know:-)

If it's a problem, we can replace code like that with
    void *data = <ctx>;
    ldap_pvt_thread_pool_keyfree_t *kfree = <freefunc>;
    pool_swapkey(, slap_sasl_bind, &data, &kfree ); /* new function */
    pool_setkey(, slap_sasl_bind, data, kfree );
or if setkey reverts to ignoring kfree when it checks whether we are
setting or deleting a key, the kfree function can be passed directly.

I wrote:

> But this reminds me: How much should an ltk_free() function be allowed
> to do?
> - pool_context_reset() calls ltk_free().  Should pool_wrapper() go
>   active (wait for !pause, increment ltp_active_count, unlock ltp_mutex)
>   before calling pool_context_reset()?

I think so.  It makes slapi safer, and I don't see that it can hurt:
context_reset() can call connection_fake_destroy(), which via
slapi_int_free_object_extensions() calls slapi module destructors,
which can do who-knows-what.

Another stray thought: It would similarly be safer for context_reset()
and purgekey() to delete the key before calling ltk_free, so that
function can't access the key's data after freeing it.  OTOH that would
break any slapi module whose ltk_free calls a function which uses the
key before freeing it.  I'll leave it alone unless someone says otherwise.

Other matters:

pool->ltp_active_list is maintained but not used for anything.
Do I delete it?  Or does tpool maintain the list to ease debugging?

Use of tpool:

Another bug:
Slapd does not always check if pool_submit() and pool_setkey() succeed,
even though failure is not a bug.  (They can run into some limits.)

I do not understand the calls to ldap_pvt_thread_pool_context_reset()
in slapd (init.c and bconfig.c).  That frees the thread's sl_malloc
memory with slap_sl_mem_destroy(), which seems rather hazardous until
the thread is really, really done.

For that matter, I guess that means context_reset() had better release
keys in the reverse order of how they were set.  That usually happened,
but not always, and my clear_key_idx() made the "not always" more
common.  So to close the gap from a deleted key I'll make it move all
later keys forward, instead of just moving the latest key to the gap.

Bdb uses the database environment address as a key to pool_setkey().
So there can be at most ~30 BDB databases with an active key at the
same time in the same thread.  What does that mean - 30 overlays
all using bdb?  Access control groups spanning 30 databases?

Nitpick: Slapd uses both function pointers and data pointers as keys to
pool_[sg]etkey().  That makes it possible for two "different" keys to
compare equal, on a host where functions and data live in different
segments that otherwise share the same address space.  So I guess I
should clean up and replace (void*)<function> keys with the address of
    static char <function>_key;