[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ldapadd: hopelessly slow loading due to high disk iowait



At 11:55 AM 10/1/99 -0400, Kelley Hu wrote:
>Where can we read more about the defined attribute types and
>syntactical rules for LDAP, with concrete examples (anything
>besides the RFCs)?

Netscape has some decent guides in their devedge pages.
I also recommend  the book "Understanding and Deploying LDAP
Directory Services" by Howes, Smith, and Good.

I also suggest review answers in this FAQ category.
  http://www.openldap.org/faq/index.cgi?file=219

You'll find links to a number of schema viewers as well as
a few examples.

>I must also admit that at a strictly information
>theoretical level, I don't understand the need for redundancy in
>the above example record.  Why does the cn value have to appear
>in the dn attribute?

Actually, the requirement is that the RDN component of the DN must
also be asserted as a attribute type/value.  It's an X.500ism.

[I agree that requiring relationships between content of an object
and the object's name is counter to some of the principles of object
oriented design...]

>I'm troubled by the apparent ordering
>requirements for the *two* dc entries in the dn attribute.

DN components are ordered (little endian), each being relative
to each other.

>I'm not trying to quibble with LDAP conventions here, just to
>understand them better.  We certainly want to construct records
>that make sense in the LDAP universe.  But the underlying second-tier
>problem is to define records that get instantiated optimally in the
>underlying database mechanism.  This is why we were (naively, I
>suspect) trying to pare the records down, in the hope that we were
>not storing information that in fact we don't really need.
>
>> You also need, of course, an appropriate objectclass.
>
>So, if i understand correctly (and I'm not at all sure that I do 8^),
>one of our records could perhaps look like this:
>
>    dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
>    cn: 1000/nl4/78654321
>    labeledURI: z39.50r://z3950.nlm.nih.gov/medline?78654321
>    labeledURN: 1000/nl4/78654321
>    objectclass: <domain>
>
>where I assume <domain> must be replaced with something appropriate,

Yes.  Also, you'd have to define labeledURN attribute type.

>though I'm not sure *what*  8^)  Does that in effect define this
>record as belonging to a particular group of records that are
>searched together (does it map conceptually to a single relational
>database table, for example)?

no.  An objectclass(es) attribute values defines which schema rules
are to be applied to the entry.

>
>> >Can I trim this down further?
>>
>> No, you actually trimmed too much.
>
>But how about this new example?  ;^)

Here is my suggestion:
	Define an new structural objectclass, say nlmUniformResource.
	Require a CN attribute type and use it to store the URN.
	Allow labeledURI (or labeledURL) to store the assocatied URI
	(URL).

In OpenLDAP 1.2, I'd create a local.at.conf with:
	attribute labeledURI	ces

and a local.oc.conf with:
	objectclass nlmUniformResource
		requires cn
		allows labeledURI

and include both after slapd.*.conf includes in slapd.conf.

Then, a LDIF for an entry would be written in LDIF as:

dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
cn: 1000/nl4/78654321
labeledURI: z39.50r://z3950.nlm.nih.gov/medline?78654321
objectclass: nlmUniformResource


I would also suggest that you maintain (minimally) equality
indices for both cn and labledURI.  This will allow speedy
forward and reverse mappings.

>And how does gdbm hiding behind OpenLDAP actually instantiate such
>a record?

The on-disk id2entry format uses integer ID (assigned in sequence)
as the key with an (psuedo) LDIF representation of the entry.
(psuedo because the ID is prefixed to the LDIF).

>Our preliminary timing studies have revealed some interesting and
>initially puzzling findings.
>
>To reiterate, our input file "ldap_list.txt" looks like this (in the
>studies to follow, we will address our failure to follow proper LDAP
>conventions in formatting records, discussed in earlier postings):
>
>   dn: dc=NLM, dc=gov
>   dc: NLM
> 
>   dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
>   cn: 1000/nl4/78654321
>   sn: z39.50r://z3950.nlm.nih.gov/medline?78654321
>
>   dn: cn=1000/nl4/78654322,dc=NLM,dc=gov
>   cn: 1000/nl4/78654322
>   sn: z39.50r://z3950.nlm.nih.gov/medline?78654322
>
>   [... further records as above for a total of 1,000,000 ...]
>
>which we are loading into a database using the command:
>
>   time ./ldapadd -D "cn=root, dc=NLM, dc=gov" -wsecret <
>ldap_list.txt > /dev/null
>
>where the time command is being used to capture timing information for
>the ldapadd command.
>
>We took the advise of turning off immediate disk/memory
>synchronization
>(by setting "dbcacheNoWsync on" in sladp.conf) in order to reduce the
>large
>iowait values we were observing, and hence speed-up database loading.
>
>However, although the iowait value initially started at a value near
>zero,
>it gradually crept up, and after 40 hours we had loaded only 530K of
>our millions records and the iowait was back up to >95%, and the CPU
>time for the ldapadd process had dropped from its initial value of
>~0.5% to only ~0.05%!  We strongly suspect that there is an
>implementation
>flaw either in LDAP or gdbm engine that is turning what should be a
>linear hashing problem into something that is binomial or cubic.  We
>are doing some timeing studes with databases of size 100, 1000, and
>10,000
>to estimate the degree of non-linearity, but the DEFINITIVE way of
>addressing this issue would be to rebuild LDAP and gdbm with profiling
>turned on, and see where in the code all this time is being spent.
>This is going a bit beyond our immediate brief (to get a snapshot of
>current art) -- is anyone out there sufficiently challenged by these
>findings to do the profiling??  Any other ideas about what is
>happening
>here?

Well, there could be a number of reasons... (including bugs), but
I'd suspect you hit some limit in the file system implementation.

You might experiment with ldif2ldbm in this case (I generally don't
recommend ldif2ldbm).  However, since I assume you are programmatically
generating the LDIF, you can test fragments (using ldapadd) to ensure
correctness and then (after whipping the database files) use
ldif2ldbm.

I suggest you also experiement with Sleepycat BerkeleyDB.  We
support both their hash and btree implementations each of which
have different performance characteristics.  

>It's important in your comments to help us differentiate effects that
>are arising from the LDAP engine as opposed to our (arbitrary) choice
>of
>gdbm as the underlying database engine.  Have other groups using
>database
>engines other than gdbm oberved similar behavior?

I generally use BDB2 btree (your mileage may vary) and have not
observed the "slow down" you describe (then again, I don't
pay too much attention to load speeds).