[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Large scale traffic testing



As always... just as you hit send to an email going to an open mailing list..

It's the bandwidth isn't it..

I'm so used to everything being a 1000mbit that I didn't spot the 100mbit limit being hit.


Will continue investigations with that additional bit of info..! :)


Thanks!

On Mon, Sep 4, 2017 at 9:59 AM, Tim <tim@yetanother.net> wrote:
Cheers guys, 

Reassuring that I'm roughly on the right track - but that leads me into other questions relating to what I'm currently experiencing while trying to load test the platform.

I'm currently using LocustIO, with a swarm of ~70 instances spread ~25 hosts, to try scale the test traffic.

The problem I'm seeing (and hence the reason why I was questioning my initial test approach), is that the traffic seems to be artificially capping out and I can't for the life of me find the bottleneck.

I'm recording/graphing all of cn=monitor, all resources covered by vmstat and bandwidth - nothing appears to be topping out.

If I perform searches in isolation, it quickly ramps up to 20k/s and then just tabletops, while all system resources seem reasonably happy.

This happens no matter what distribution of clients I deploy (i.e. 5000 clients over 70 hosts or 100 clients over 10 hosts) - so fairly confident that the test environment is more than capable of generating further traffic.

https://s3.eu-west-2.amazonaws.com/uninspired/mystery_bottleneck.png 

(.. this was thrown together in a very rough and ready fashion - it's quite possible that my units may be off on some of the y-axis!)

I've performed some minor optimisations to try and resolve it (number of available file handles was my initial hope for an easy fix..) but so far, nothings helped - I still see this capping of throughput prior to the key system resources even getting slightly hot.

I had hoped that it was going to be as simple as increasing a concurrency variable within the config - but the one that does exist seems to not be valid for anything outside of legacy solaris deployments?

If anyone has any suggestions as to where I could investigate for a potential bottle neck (either on the system or within my openldap configuration) it would be very much appreciated.


Thanks in advance



On Mon, Sep 4, 2017 at 7:47 AM, Michael Ströder <michael@stroeder.com> wrote:
Tim wrote:
> I've, so far, been making use of home grown python-ldap3 scripts to
> simulate the various kinds of interactions using many parallel synchronous
> requests - but as I scale this up, I'm increasingly aware that it is a very
> different ask to simulate simple synchronous interactions compared to a
> fully optimised multithreaded client with dedicated async/sync channels and
> associated strategies.

Most clients will just send those synchronous requests. So IMHO this is the right test
pattern and you should simply make your test client multi-threaded.

> I'm currently working with a dataset of in the region of 2,500,000 objects
> and looking to test throughput up to somewhere in the region of 15k/s
> searches alongside 1k/s modification/addition events - which is beyond what
> the current basic scripts are able to achieve.

Note that the ldap3 module for Python is written in pure Python - also the ASN.1
encoding/decoding stuff. In opposite to that the old Python 2.x https://python-ldap.org
module is a C wrapper module around the OpenLDAP libs and therefore you might get a
better client performance. Nevertheless you should spread your test clients over several
machines to really achieve the needed performance.

Ciao, Michael.




--



--
Tim
tim@yetanother.net