[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#3665) Multi-Listener Thread Support



Anton Bobrov wrote:

increasing the number of pollers/readers should help significantly
on massively multi cpu/core systems help but it requires some fine
tuning because you have to evaluate that against the cost of
processing those requests otherwise all you gonna do is saturate
your work queue so its good to have some mechanism to cap it and
apply the brakes when needed as well. the real problems come from
the cost of synchronization on those systems tho. while back i had
some play time with fully loaded t5440 [256 h/w threads] and did
manage to get it to 85% utilized with OpenDS. there was of course
quite a number of threads involved thus various synchronization
issues associated with them at that scale and architecture that
normally have an insignificant difference on smaller systems. like
when you have your multiple pollers/readers putting things on the
work queue and multiple worker threads taking things from it. the
more creative you can get making things safely lockless the better.

Unfortunately, at this time writing lockless algorithms means resorting to heavily machine-dependent code and we've been trying to stick to standardized e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly producer/consumer queue in assembly language for a few specific architectures, and maybe doable using compiler-specific intrinsics, but our portability would go out the window.

(Which is not to say that the thought hasn't crossed my mind, numerous times already. I still have a very nice implementation I wrote in sparc assembly language kicking around here, but it seems that only x86-64 matters these days; that and ARM...)

On 02/08/2010 08:34, Emmanuel Lécharny wrote:
Here's the situation: suppose you have thousands of clients connected
and active. Even if you have CPUs to spare, the number of connections
you can acknowledge and dispatch is limited by the speed of the single
thread that's processing select(). Even if all it does is walk thru
the list of active descriptors and dispatch a job to the thread pool
for each one, it's only possible to dispatch a fixed number of
ops/second, no matter how many other CPUs there are.

I'm a bit surprised that the select() processing *is* the bottleneck...
All in all, it's just -internally- a matter of processing a bit field to
see which bit is set to 1, and then get back the FD that is associated
with this bit. You must have some other tasks running that create this
bottleneck.

I will have to check OpenLDAP code here...


Right now on a 24 core server I'm seeing 48,000 searches/second and
50% CPU utilization. Adding more clients only seems to increase the
overall latency, but CPU usage and throughput don't increase any further.



--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/