[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ldapadd: hopelessly slow loading due to high disk iowait



Kurt (and OpenLDAP list members),

Thanks for a very lucid and helpful response.  I would like To follow
up with a few more questions.  Comments/questions intercalated below.
Please be patient, as we are truly trying to understand what we're
doing here so as to be able to represent LDAP fairly, and to ensure
that we are employing it in a way consistent with *correct* LDAP
usage.

To summarize our goals: we are doing timing and feature studies of a
number of different methods that have been proposed for resolving
unique
identifiers (specifically, Uniform Resource Names, or URNs) to URLs.  
LDAP has been suggested as one avenue, so we have created a small
testbed using:

  Sun SPARC 2 under Solaris 2.6, with dual 200 MHz processors,
     256 MB RAM, 2 2GB internal drives, 3 9GB fast/wide SCSI external
drives,
     400 MB of shared /tmp and swap space split evenly between two
internal
     drives
  OpenLDAP version 1.2.7
  gdbm version 1.8
  test database information: 1 million records, each one consisting of
a
     45-character URL and a 17-character URN.

What follows is lengthy, but I believe the issues it raises are of
considerable practical importance in using LDAP, so I hope some of you
will persist to the end.

Thanks in advance for further enlightenment, and keep up the great
work
with OpenLDAP!

Best Regards, Rick Rodgers

> From Kurt@openldap.org Thu Sep 30 12:23:13 1999
> To: Kelley Hu <khu@nlm.nih.gov>
> From: "Kurt D. Zeilenga" <Kurt@openldap.org>
> Subject: Re: what's the simplest ldif record openLdap can accept?
>
> At 12:09 PM 9/29/99 -0400, Kelley Hu wrote:
> >Dear openLdap users,
> >
> >
> >We are trying to evaluate openLdap as possible candidate for servers
> >for resolving Uniform Resource Names (URNs) to Uniform Resource
> >Locators(URLs).   In order to achieve the highest performance,  our
> >record should be trimmed down to bare minimum

> Actually, this is not generally true.  For highest performance,
> your data needs to be represented in a manner which is indexing
> and caching can be optimized for the search patterns most used.

Very nicely put!

> >yet comply with ldif format.
>
> s/comply with ldif format/comply with schema and other LDAP restriction/

You put your finger on the crux of our problem.  In a certain sense
we are preverting the originally intended use of LDAP (white-page-like
directory services) with this exercise (though your mention of
labeledURL and labeledURI below make me realize that we are perhaps
not being as twisted as I had thought).  The challenge for us is to
employ LDAP to accept a URN and return one or more corresponding URLs,
in such a way that we are complying with correct LDAP conventions
while still building a database behind the LDAP engine that is
optimally
speedy.  Our test version of OpenLDAP is currently using gdbm as its
database mechanism.  We wish to time both the creation of a large
database, as well as retrievals.

> > What is the minimum per record the ldif required?
>
> What are the minimally requirements for a LDAP entry.
>
> It must have a DN, a structural objectclass, and attribute
> value to fulfill RDN requirements.
>
> >for example,  my one million records starts with:
> >
> >dn: dc=NLM, dc=gov
> >dc: NLM
>
> Though valid LDIF, the record does not represent a valid
> LDAP entry.  An LDAP entry should have a structural object
> class such as domain or organization.
>
> dn: dc:NLM,dc=gov
> dc: NLM
> objectclass: domain
>
> would represent a minimal LDAP entry.
>
>
> >Then, a million of records like:
> >
> >dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
> >cn: 1000/nl4/78654321
> >sn: z39.50r://z3950.nlm.nih.gov/medline?78654321
>
> Same here, no objectclass.
>
> I don't recommend abusing sn for URIs, labeledURI (or the
> older labeledURL attribute type) would be a better choice.
>
> dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
> cn: 1000/nl4/78654321
> labeledURI: z39.50r://z3950.nlm.nih.gov/medline?78654321

Where can we read more about the defined attribute types and
syntactical rules for LDAP, with concrete examples (anything
besides the RFCs)? I must also admit that at a strictly information
theoretical level, I don't understand the need for redundancy in
the above example record.  Why does the cn value have to appear
in the dn attribute?  I'm troubled by the apparent ordering
requirements for the *two* dc entries in the dn attribute.
I'm not trying to quibble with LDAP conventions here, just to
understand them better.  We certainly want to construct records
that make sense in the LDAP universe.  But the underlying second-tier
problem is to define records that get instantiated optimally in the
underlying database mechanism.  This is why we were (naively, I
suspect) trying to pare the records down, in the hope that we were
not storing information that in fact we don't really need.

> You also need, of course, an appropriate objectclass.

So, if i understand correctly (and I'm not at all sure that I do 8^),
one of our records could perhaps look like this:

    dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
    cn: 1000/nl4/78654321
    labeledURI: z39.50r://z3950.nlm.nih.gov/medline?78654321
    labeledURN: 1000/nl4/78654321
    objectclass: <domain>

where I assume <domain> must be replaced with something appropriate,
though I'm not sure *what*  8^)  Does that in effect define this
record as belonging to a particular group of records that are
searched together (does it map conceptually to a single relational
database table, for example)?

> >Can I trim this down further?
>
> No, you actually trimmed too much.

But how about this new example?  ;^)
And how does gdbm hiding behind OpenLDAP actually instantiate such
a record?

> >All I need is a searchable name
> >(1000/nl4/78654324) / value
> >(z39.50r://z3950.nlm.nih.gov/medline?78654324) pairs.
>
> I recommend that you keep the URN and URI values in separate
> attributes.  This allows you to build useful equality indices.
>
> Kurt

Our preliminary timing studies have revealed some interesting and
initially puzzling findings.

To reiterate, our input file "ldap_list.txt" looks like this (in the
studies to follow, we will address our failure to follow proper LDAP
conventions in formatting records, discussed in earlier postings):

   dn: dc=NLM, dc=gov
   dc: NLM
 
   dn: cn=1000/nl4/78654321,dc=NLM,dc=gov
   cn: 1000/nl4/78654321
   sn: z39.50r://z3950.nlm.nih.gov/medline?78654321

   dn: cn=1000/nl4/78654322,dc=NLM,dc=gov
   cn: 1000/nl4/78654322
   sn: z39.50r://z3950.nlm.nih.gov/medline?78654322

   [... further records as above for a total of 1,000,000 ...]

which we are loading into a database using the command:

   time ./ldapadd -D "cn=root, dc=NLM, dc=gov" -wsecret <
ldap_list.txt > /dev/null

where the time command is being used to capture timing information for
the ldapadd command.

We took the advise of turning off immediate disk/memory
synchronization
(by setting "dbcacheNoWsync on" in sladp.conf) in order to reduce the
large
iowait values we were observing, and hence speed-up database loading.

However, although the iowait value initially started at a value near
zero,
it gradually crept up, and after 40 hours we had loaded only 530K of
our millions records and the iowait was back up to >95%, and the CPU
time for the ldapadd process had dropped from its initial value of
~0.5% to only ~0.05%!  We strongly suspect that there is an
implementation
flaw either in LDAP or gdbm engine that is turning what should be a
linear hashing problem into something that is binomial or cubic.  We
are doing some timeing studes with databases of size 100, 1000, and
10,000
to estimate the degree of non-linearity, but the DEFINITIVE way of
addressing this issue would be to rebuild LDAP and gdbm with profiling
turned on, and see where in the code all this time is being spent.
This is going a bit beyond our immediate brief (to get a snapshot of
current art) -- is anyone out there sufficiently challenged by these
findings to do the profiling??  Any other ideas about what is
happening
here?

It's important in your comments to help us differentiate effects that
are arising from the LDAP engine as opposed to our (arbitrary) choice
of
gdbm as the underlying database engine.  Have other groups using
database
engines other than gdbm oberved similar behavior?

We'll be sharing the final results of our studies with the community,
and we thank you in advance for helping us to make the study more
solid.
Thanks again -- we're looking forward to your further illuminations...

 Rick Rodgers & Kelley Hu
 U.S. National Library of Medicine, Computer Science Branch
 Bethesda, MD
 (301) 435-3205      khu@nlm.nih.gov