[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Fwd: multiple sequential lmdb readers + spinning media = slow / thrashes?

On Thu, Feb 26, 2015 at 3:46 PM, Howard Chu <hyc@symas.com> wrote:
> Matthew Moskewicz wrote:
>> warnings: new to list, first post, lmdb noob.
>> https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp
>> however, if i *both*
>> 1) run  blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
>> 2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
>> then i can get >1 reader to run without being IO limited.
> This is quite timing-dependent - if you start your multiple readers at
> exactly the same time and they run at exactly the same speed, then they will
> all be using the same cached pages and all of the readers can run at the
> full bandwidth of the disk. If they're staggered or not running in lockstep,
> then you'll only get partial performance.

thanks for the quick reply. to clarify: yes, this is indeed the case.
when/if the readers are reading 'near' each other (within cache size)
there is no issue, but over time they drift out of sync, and this is
the case i'm considering / when i'm having an issue. these are
long-running processes that loop over the entire db 200GB lmdb many
times over days, at around 2 hours per epoch (iteration over all

when i say i can get >1 reader to be not IO limited with my changes, i
mean that things continue to work (not be IO limited) even as the
readers go out of sync. the processes happen to output information
sufficient to deduce when they have de-synced by more than the amount
of system memory in terms of the lmdb offset at which they are
reading. empirically: without my changes, for a particular 2 readers
case, the readers would reliably drop out of sync within a few hours
and slow down by at least ~2X (getting perhaps ~20MB/s bandwidth);
with the changes i've had 2 runs going to multiple days without issue.

for my microbenchmarking i simulate the out-of-sync-ness and take care
to ensure i'm not reading cached areas, either by flushing the caches
or by just carefully choosing offsets into a 200GB lmdb on a machine
with only 32GB ram. i'd prefer to 'clear the cache' for all tests, but
that doesn't actually seem possible when there is a running process
that has the entire lmdb mmap()'d. that is, i don't know of any method
to make the kernel drop the clean cached mmap()'d pages out of memory.
but, caveats aside, i'm claiming that:

a) with the patch+readahead i get full read perf, even when the
readers are out of sync / streaming though well-separated (i.e. by
more than the size of system memory) parts of the lmdb.
b) without them i see much reduced read performance (presumably due to
seek trashing), sufficient to cause the caffe processes in question to
slow down by > 2X.

>> for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt
>> similarly, using a sequential read microbenchmark designed to model the
>> caffe reads from here:
>> https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc
>> if i run one reader, i get 180MB/s bandwidth.
>> with two readers, but neither (1) nor (2) above, each gets ~30MB/s
>> bandwidth.
>> with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.
> The other point to note is that sequential reads in LMDB won't remain truly
> sequential (as seen by the storage device) after a few rounds of
> inserts/deletes/updates. Once you get any element of seek/random I/O in here
> your madvise will be useless.

yes, makes sense. i should have noted that, in this use model, the
lmdbs are in-order-write-once and then read-only thereafter -- they
are created and used in this manner specifically to allow for
sequential reads. i'd assume this is not actually reliable in general
due to the potential for filesystem-level fragmentation, but i guess
in practice it's okay. often, these lmdbs are being written to
spinners that are 'fresh' and don't have much filesystem level churn.