[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: slow slapadd?

To: Oskar Pearson <oskar@deckle.co.za>
Subject: Re: slow slapadd?
From: Howard Chu <hyc@symas.com>
Date: Sun, 17 May 2009 23:56:35 -0700
Cc: Diego Figueroa <dfiguero@yorku.ca>, openldap-software@openldap.org
In-reply-to: <A43CC7E6-C417-4DA4-B5A5-9EEF2F4DCADE@deckle.co.za>
References: <OFE5EA9700.A6186731-ON852575B7.00621719-852575B7.0063CDA6@yorku.ca> <2C5E4A88E474B99E3142A701@STONEKING-LM.CORP.YAHOO.COM> <OFB3ADCF08.3B7B3B16-ON852575B7.006D2BAF-852575B7.006D5BEE@yorku.ca> <A43CC7E6-C417-4DA4-B5A5-9EEF2F4DCADE@deckle.co.za>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; rv:1.9.1b5pre) Gecko/20090517 SeaMonkey/2.0a1pre Firefox/3.0.3

Oskar Pearson wrote:

Hi Diego

On 15 May 2009, at 20:54, Diego Figueroa wrote:


Thanks for your input Quanah,

I also just noticed that top is reporting 50-90% I/O waiting times.
I might have to look at my disks to further improve things.


That can be an over-simplification - you may be right, but it could be
an over-simplification.

Random seeks will always create a performance slowdown on physical
disks. If you optimise the DB so that you reduce the number of random
seeks, you'll get dramatically faster performance.

Realise that if your db is, say, 200mb, you could probably write the
whole file contiguously in 3-4 seconds on most server PCs. But if you
do 1 seek per object in your 500k item database with reasonable seek-
time disks (say 6.5ms), you'll be doing 500000 seeks *6.5ms = 3250000
ms = 3250 seconds = 54 minutes.

http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/throughput.html
   says that every write can do the following seeks:

	1 Disk seek to database file
	2 Database file read
	3 Disk seek to log file
	4 Log file write
	5 Flush log file information to disk
	6 Disk seek to update log file metadata (for example, inode
information)
	7 Log metadata write
	8 Flush log file metadata to disk



So, what to do? Well, if you update cache values, you'll find less
reads. If you assume each item above is equal in wall-clock time, you
could remove the first 3 items and speed things up 37.5% for every one
of the cache hits.

You could also put the log file on a separate disk.


That is standard practice, as recommended in the BDB docs.

Or you could
perhaps put the log file on a ramdisk for your build, and move it to a
stable disk after it completes. I'm assuming you don't have sufficient
ram to store the whole db on a ramdisk, which would be the ideal for
the build process.

Putting the logfile on a ramdisk or other volatile storage completely defeatsthe purpose of the logfile...

You can also mount your filesystems with -noatime, which will help by
removing step 6. Note you'll have to check whether this breaks other
things on your system.

You could also try fiddle with the DB_TXN_WRITE_NOSYNC and
DB_TXN_NOSYNC flags. I've not done that, and you'd have to be 100%
sure that once your db goes live, this flag is then turned off or you
disk disaster if your db server reboots. I wonder if it's possible for
slapadd to turn these on automatically for the load process (perhaps
it already does - I'm ignorant on that fact, unfortunately).

When using slapadd -q the transaction subsystem is disabled, so no synchronouswrites/flushes of are performed by the main program. However, on some versionsa background thread may be spawned off to perform trickle syncs, which mayalso be causing some seek traffic.

If you're feeling brave, and are building on a throwaway system (where
you can reinstall due to filesystem corruption), you could also use
something like hdparm under linux to change the disks so that they
always return writes as successful immediately, even if the data
hasn't been written to disk. I don't recommend this, but I've been
known to do it when testing on a dev system. I don't have any stats on
how much it'd help.

It wouldn't help at all since disks have such small on-board caches. Once thedrive cache fills, it's forced to wait for some queued I/O to complete anywaybefore it can proceed.

Another thing: I read an article a while back where someone found that
innodb file fragmentation on mysql dbs created a massive slowdown over
time with random small writes to to the file. The solution was fairly
simple - move the files to a different directory, make a copy back
into the original directory, and start the db again running off the
copy. The new files will be written contiguously with very little
fragmentation. It's not possible to do this mid-stream in the load on
a new DB, but it may be a good practice once you have a very large
complete DB file that's been built over time.

Fragmentation is not an issue with back-bdb/hdb when creating new databases. Idon't think it's much of an issue on heavily used databases either, due to theway BDB manages data.

The biggest factor is simply to configure a large enough BDB cache to preventinternal pages of the Btrees from getting swapped out of the cache. The otherfactor to consider is that BDB uses mmap'd files for its cache, by default. Onsome OSes (like Solaris) the default behavior for mmap'd regions is toaggressively sync them to the backing store. So whenever BDB touches a page inits cache, it gets immediately written back to disk. On Linux the defaultbehavior is usually to hold the updates in the cache, and only flush them at alater time. This allows much higher throughput on Linux since it will usuallybe flushing a large contiguous block instead of randomly seeking to do a lotof small writes. However, on BDB 4.7 it seems the default behavior on Linux isalso to do synchronous flushes of the cache. As such, one approach to gettingconsistent performance is to configure the backend to use shared memory forthe BDB cache instead of mmap'd files. That way incidental page updates don'tsync to anything, and the BDB library has full control over when pages getflushed back to disk.


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Follow-Ups:
- Re: slow slapadd?
  - From: Bill MacAllister <whm@stanford.edu>

References:
- slow slapadd?
  - From: Diego Figueroa <dfiguero@yorku.ca>
- Re: slow slapadd?
  - From: Quanah Gibson-Mount <quanah@zimbra.com>
- Re: slow slapadd?
  - From: Diego Figueroa <dfiguero@yorku.ca>
- Re: slow slapadd?
  - From: Oskar Pearson <oskar@deckle.co.za>

Prev by Date: Re: slow slapadd?
Next by Date: Re: slow slapadd?
Index(es):
- Chronological
- Thread