[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: HDB physical files separate from DIT



matthew sporleder wrote:
We were simply evaluating our options for openldap redesign (read
upgrades).  Here's a reply from one of our engineers:

----------
Thank you Howard for your reply. I would like to know how many BDB
files you had to achieve this great performance with 150 million
entries.

No more than usual - 1 id2entry file, 1 dn2id file, and an index file per indexed attribute.
My request is to collapse the DIT structure for our customers. I need
a single "ou=people" container and be able to accommodate all customer
entries into this logical context. This container has to be able to
grow. It has to be able to grow to many, many millions. Currently we
are thinking of a number in between 20MM to 50MM. However, because of
scale limitations in the past we had to split out people container
into 14 sub-OU's which are named as cities.

There are no scale limitations in back-bdb/back-hdb. I don't know what you were using in the past, but our tests show that OpenLDAP's scaling is limited only by the available disk and memory space.


A commercial LDAP vendor is highlighting one of his features that
would allow us to have an arbitrary amount of physical files belonging
to the same OU. We would be able to reduce the physical file size and
retrieve performance gains through shorter searches, smaller indices,
and whatever else benefits from small files.  It would greatly reduce
our administrative cost/burden and reduce some costly moving of
people/entries in our environment.



Your commercial vendor is taking one of their architectural flaws (scaling limits) and trying to spin it as an asset. Fortunately for the folks on this list, the OpenLDAP project does not have any marketing department sitting around spinning lies, nor could any lies survive for long since the project and code is totally open.


Here's the truth - the data structures and algorithms used in OpenLDAP are mostly of logarithmic complexity. Some are linear, some are constant, but where scaling is an issue, it's all logarithmic. These structures and algorithms, like Btrees, AVL trees, binary searches, etc., all operate on a divide-and-conquer principle. That means every large problem is inherently divided into sets of smaller problems. E.g., when all else is equal, it is just as fast to perform a single binary search in one long list (length N) as it is to perform two binary searches in two lists (length N/2). In fact all else is not equal; doing a single search means you only do the setup once, so it's faster to keep everything in a single structure. This is all basic, obvious stuff.

  Also a lot of our LDAP-Depending
applications could be simplified. There is no need for our business to
know where a customer is coming from. We have no value of that
information to us, but we are unable to get rid of it.

There's nothing in OpenLDAP requiring you to operate this way.

---------

To expand a little bit, splitting the tree allowed us to keep database
files small and recoverable from another source, and we could split
the database accross different disks for I/O gains. Although more
intelligent indexing could probably help a lot in these respects.

Spreading databases across multiple disks is usually accomplished by striping or spanning (e.g., using logical volumes, RAID, or whatever) and really doesn't need to be addressed at the database level.



Thanks,
_Matt

On 1/4/06, Howard Chu <hyc@symas.com> wrote:
matthew sporleder wrote:
I'm trying to figure out if I can abstract a database's logical layout
(DIT) from being bound to specific files per 'database' definition,
and I'm not seeing any good tips in the berkeley db tuning docs.

For example:

I have ou=region1,dc=example,dc=com and ou=region2,dc=exmaple,dc=com.
Right now the only options I see of separating these are to define
them in different 'database' sections.  I would, however, like to have
them both defined in one database, but allow the actual database files
(dn2id, etc) to be split in terms of size, or other definables.
(usage stats, whatever)

Am I missing something obvious in DB_CONFIG like "max_file_size"?

No, there's no such feature. Nor does it sound like it would be useful,
given what little you've described so far. Even if you allowed a
particular DB file to be split, all of the files would still occupy
space in the single BDB environment cache. In fact, since each DB handle
also consumes cache space, splitting files would consume more resources
than otherwise. Given that we've benchmarked a directory with 150
million entries consuming about a terabyte of disk space, using the
current back-bdb code, getting tens of thousands of operations per
second throughput, I don't see any particular reason to bother with
splitting the files. Perhaps if you explained what real problem you're
trying to solve, it might make a bit more sense.

--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sun        http://highlandsun.com/hyc
 OpenLDAP Core Team            http://www.openldap.org/project/