[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: large multivalued attributes



Kurt D. Zeilenga writes:
>At 12:09 PM 8/16/2006, Howard Chu wrote:
>>
>> This issue has been around for a long time, and I think we're going
>> to need to deal with it soon, as we see sites creating groups with
>> thousands of members.

Yup.  Our site currently has 6 with >20000 members, 7 more with >5000.

>> There's a couple different aspects that need to be addressed.
>>  1) finding an individual value in a large multivalued attribute is
>>     inefficient(all linear searches).
>>  2) updating a single indexed value basically requires regenerating
>>     the index for the entire attribute.
>>
>> The only obvious solution to 1) seems to be to always maintain
>> multivalued attributes in sorted order, so that we can use binary
>> search to locate values.

Or support a different data structure for attributes, so we can use
hashes.

>> This would make filter comparisons faster,
>> as well as address things like ITS#4600, and locating values to
>> modify in modify operations.
>
> I wonder, for small numbers of values, whether sorting would have
> any significant impact on performance. Probably in the noise.

In any case, it's a nice feature that e.g. phone numbers in an entry can
be returned in the preferred order.  So if it doesn't get cumbersome, I
would prefer to be able to turn a "don't reorder" feature on in
slapd.conf.


I may be misunderstanding what you say below, but anyway:

>> The most efficient fix for 2) requires us to use an index mechanism
>> that avoids collisions, so that we can safely remove index keys for
>> individual values without worrying about losing indexing for an
>> unrelated value. I don't see that actually happening, though.
>
> I doubt it actually possible (given subtyping issues).

Subtyping or no, that would more or less need the index key to be the
value itself or a reversible encoding of it, rather than a hash.

>> The next alternative would be to use an index mechanism that allows
>> duplicate keys, so that we can accumulate them when collisions
>> occur. Then removing a single instance for one value doesn't affect
>> the instances for other values.
>
> One alternative would be, after the number of values exceeds some
> limit, to leave old keys in place (just add keys for new values).

And maybe keep some sort of estimate of how much "historical data" is
left, and regenerate when it grows too high.

>> We could accomplish the latter goal by changing the index structure
>> in the DB. Currently we use a hash key and the data is a list of
>> entryIDs for entries that match the hash. We could change the list to
>> be entryID+ counter, so we can keep track of these collisions.
>>
>> Unfortunately keeping a counter like this would make slapindex a lot
>> harder, we'd somehow have to know not to increment the counter when
>> we're just re-indexing something that was already indexed.
>> (Or we could just empty out/truncate the entire index before
>> processing anything.)

"slapindex because slapd.conf 'index' statements were changed" would
know what to do if BDB maintained a list of currently indexed attributes
in the database directory.  Which would be nice to have anyway, since
it's a fairly common error to change slapd.conf and not reindex.

"slapindex to try to fix broken/inconsistent database" could truncate
and reindex.

-- 
Hallvard