[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: back-bdb IDL limitations




The proposed scheme can increase the size of IDLs for attribute indices (esp. substring indices). If neighboring indices in an IDL are spaced with large intervals, the proposed scheme can either 1) double the storage required to represent an ID in the IDL (because lo/hi are needed instead of a single ID), or 2) cause IDs of many irrelevant entries to be included in the IDL.

Rather coincidentally, I've also prepared patches to optimize indexing performance.

The first one is targeting to accelerate slapadd based on the observation that IDs are added in a monotonically ascending order during slapadd. Instead of writing to the index database upon addition of every single ID, slapadd can be made to buffer multiple IDs to make an ID cluster and writes the cluster as a single unit.

An IDL for an index hash will look like the following:

Header consisting of NOID and # of IDs in the IDL
ID cluster 1
ID cluster 2
ID cluster 3
...
ID cluster N

The header and ID clusters are keyed with the same hash value and they are sorted in an ascending order (NOID is treated as the smallest in the dup function). As a result, the number of writes to the index databases can be reduced by the factor of the cluster size. For example, if the cluster size is 1024, the number of writes to the index database can be reduced by up to 1024 and this significantly improves the performance of slapadd.

If the IDs in an IDL are rather contiguous, the range optimization you suggested can help. Otherwise, it won't be able to buy us much. I suspect this be the case for the subtree indices for general ldifs (by the term general, I mean not specially sorted for a naming attribute, for example, ou) and substring indices.

For non-tool mode addition (ldapadd), if the currently added ID is the greatest of all IDs in the IDL, it is added as a single ID stored as the end data for the index key. If the currently added ID is not the greatest, then it is added into the ID cluster it belongs (where lo <= ID <= hi holds). If the size of the cluster becomes larger than the max cluster size, it is split into half. To delete an ID from an IDL, the corresponding ID cluster need be located. The cluster will be read, modified, and written in order to remove the ID from the cluster.

The second one is targeting to accelerate the overall index database performance by using a bit-map to represent IDLs. The compaction achieved by using a bit map instead of using a pointer ID can result in the reduced number of data items for an index key in the index database. This in turn leads to the improved btree search performance. A bit map structure called the segment bit-map can be used for this purpose. A segment bit-map consists of a pointer and a bit map. For example, let's assume that a segment bit-map is 32 bits long of which 24 bits are assigned to the pointer and 8 bits are to the bit map.  The pointer represents the 8-bit aligned position of the bit map in the entire ID space while the bit map represents the indexing status for 8 contiguous indices whose position is determined by the pointer. In this example, the segment bit-map can cover indices for (2**24) * 8 entries (128M entries). As a concrete example, a segment bit-map having its pointer field of 2 and its bit map field of 0xA5 tells that the IDs 16, 18, 21, 23 are the indexed entries. We can have multiple of them to represent an entire IDL in the same way we had multiple IDs to represent an IDL. If we use 64 bit long segment bit-maps, then we can increase the coverage that a single segment bit-map can provide and can increase the number of indexed entries. For example if we assign 32 bits to the pointer field and 32 bits to the bit map field, a 64 bit segment bit-map can store indexing values for 32 contiguous entries and it can support up to 128G entries.

I think this bit-mapped approach provides more workable solution than the range ID solution for the case where IDs in an IDL are rather contiguous. A 32 bit segment bit map covers 8 entries having contiguous IDs without increasing storage requirement and with the ability of pinpointing the exact indexing status of the 8 entries in the contigous range. On the other hand, the range ID consumes double the storage, but it cannot pinpoint the indexing status of the indiviual entries in the range once it is scaled. A 64 bit segment bit map covers 32 entries having contigous IDs with the same amount of storage as in the range ID scheme, but it can provide exact indexing for those 32 entries unlike the range ID approach with scaling.

- Jong-Hyuk

------------------------
Jong Hyuk Choi
IBM Thomas J. Watson Research Center - Enterprise Linux Group
P. O. Box 218, Yorktown Heights, NY 10598
email: jongchoi@us.ibm.com
(phone) 914-945-3979    (fax) 914-945-4425   TL: 862-3979



Howard Chu <hyc@symas.com>
Sent by: owner-openldap-devel@OpenLDAP.org

03/03/2005 05:35 PM

       
        To:        openldap-devel@OpenLDAP.org
        cc:        
        Subject:        back-bdb IDL limitations



At the risk of seriously complicating the IDL code, I wonder if it's
time to investigate a more compact IDL representation - perhaps using
ranges exclusively, where each IDL is a list of ranges.
There would still be pathological cases where discontiguous IDs would
consume a lot of space, but they should be pretty rare.

IDL[0] would still hold the number of slots in use. Each slot is a lo/hi
pair. So we can express any contiguous range with a single pair, and
handle discontinuities with additional ranges. An IDL cursor would be a
(slot,value) pair instead of the single integer we currently use. A
scheme like this could dramatically reduce the size of IDLs overall.

Borrowing from some other data compression techniques - we can also
assign a scale factor if needed, to further expand the dynamic range of
the IDLs.

For example, assume we have an IDL that can only accomodate 4 slots, and
each slot starts with a scale factor of 1. The first slot carries a
range 1-5, then the data skips one, so we use a second slot for IDs
7-12, skip a few more, slot 3 covers 15-17, skip some more, and slot 4
fills up with the range 20-30. Now we want to insert the value 32, but
there are no more slots free. We bump up the scale factor to 2. This
means that skips of 2 or less are treated as contiguous. (Think of it as
rounding, coarsening the granularity, or shifting one bit.) So the first
slot covering 1-5 is now considered contiguous with the second slot at
7-12, so they're combined into one slot. The next slot remains
unchanged, and the next slot 20-30 is contiguous with 32, so it becomes
20-32.

So we started with this IDL:
  4,1   4 slots, scale factor 1
  1,5
  7,12
  15,17
  20,30

And ended with this IDL:
  3,2   3 slots, scale factor 2
  1,12
  15,17
  20,32

Obviously for the real implementation we would use more than 4 slots.
But I don't think a fixed size implementation like we have now is useful
for the sizes of databases we're seeing more and more. Nor is collapsing
it all into a single range very helpful for indexed lookups; the quality
of the indices really suffers. If we use a dynamic scaling approach like
I'm talking about, there will still be some losses in granularity but
the effectiveness of the indices will degrade much more slowly as the
database sizes grow.

--
 -- Howard Chu
 Chief Architect, Symas Corp.       Director, Highland Sun
 http://www.symas.com               http://highlandsun.com/hyc
 Symas: Premier OpenSource Development and Support


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature