[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: write-scaling problems in LMDB

Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu <hyc@symas.com> wrote:
Howard Chu wrote:

Luke Kenneth Casson Leighton wrote:


can i make the suggestion that, whilst i am aware that it is generally
not recommended for production environments to run more processes than
there are cores, you try running 128, 256 and even 512 processes all
hitting that 64-core system, and monitor its I/O usage (iostats) and
loadavg whilst doing so?

the hypothesis to test is that the performance, which should scale
reasonably linearly downwards as a ratio of the number of processes to
the number of cores, instead drops like a lead balloon.

Looks to me like the system was reasonably well behaved.

  and it looks like the writer rate is approximately-halving with each
doubling from 64 onwards.

  ok, so that didn't show anything up... but wait... there's only one
writer, right?  the scenarios where i am seeing difficulties is when
there are multiple writers and readers (actually, multiple writers and
readers to multiple envs simultaneously).

  so to duplicate that scenario, it would either be necessary to modify
the benchmark to do multiple writer threads (knowing that they are
going to have contention, but that's ok) or, to be closer to the
scenario where i have observed difficulties to run the test several
times *simultaneously* on the same database.

  *thinks*.... actually in order to ensure that the reads and writes
are approximately balanced, it would likely be necessary to modify the
benchmark code to allow multiple writer threads and distribute the
workload amongst them whilst at the same time keeping the number of
reader threads the same as it was previously.

  then it would be possible to make a direct comparison (against the
figures you just sent), against the e.g. 32-threads case.  32 readers,
2 writers.  32 readers, 4 writers.  32 readers, 8 writers and so on.
keeping the number of threads (write plus read) to below or equal the
total number of cores avoids any unnecessary context-switching

We can do that by running two instances of the benchmark program concurrently; one doing a read-only job with a fixed number of threads (32) and one doing a write-only job with the increasing number of threads.

  the hypothesis being tested is that the writers performance overall
remains the same, as only one may perform writes at a time.

  i know it sounds silly to do that: it sounds so obvious that yeah it
really should not make any difference given that no matter how many
writers there are they will always do absolutely nothing (except one
of them), and the context switching when one finishes should also be
negligeable, but i know there's something wrong and i'd like to help
find out what it is.

My experience from benchmarking OpenLDAP over the years is that mutexes scale only up to a point. When you have threads grabbing the same mutex from across socket boundaries, things go into the toilet. There's no fix for this; that's the nature of inter-socket communication.

This test machine has 4 physical sockets but 8 NUMA nodes; internally each "processor" in a socket is really a pair of 8-core CPUs on a MCM which is why there are two NUMA nodes per physical socket.

Write throughput should degrade pretty noticeably as the number of writer threads goes up. When we get past 8 writer threads there's no way to keep them all in a single NUMA domain, so at that point we should see a sharp drop in throughput.

  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/