[Date Prev][Date Next]
Re: equality indices & hashes
Or maybe go to a 5 byte key: 1 byte type, 4 byte hash.
We have more key types than 4 (present, equality, subinit,
subany, subfinal), and I rather not reduce the hash size
(if anything, I rather increase it to reduce collisions
further within each kind).
At 09:41 AM 10/15/2004, Howard Chu wrote:
>Hallvard B Furuseth wrote:
>>As far as I can tell, an equality index is an index of hashes of the
>>normalized values, not of the entire normalized values. Also, I think
>>an equality index can have hash collisions with substring indices.
>Yes. Pretty sure I raised this issue a long time ago, when 2.1 was still in Alpha. I've definitely seen these collisions occur.
>>If a value's equality hash collides with a substring hash which matches
>>a lot of values (e.g. the last 4 characters of an e-mail address), it
>>will be impossible or very slow to look up that value by an equality
>>filter. I expect the same can happen with presence indices, though I
>>haven't checked the code for that. It will be rare, but when or if it
>>happens, it will be experienced as a _very_ mysterious bug.
>The key for presence indexing is not generated by the frontend. In either back-bdb or back-ldbm the likelihood of a collision is vanishingly small. Nor is there any consequence if any collision with the presence key does occur.
>>So I suggest that 2 bits of the hash value are reserved to indicate
>>whether it is an equality, presence, substring or approx hash.
>Sounds like a good idea. Right now an indicator is fed into the hash input, and that obviously is still prone to collisions. Explicitly masking it onto the hash output would cure that issue. Also sounds like something that will have to wait for RE23 since it will cause an incompatible format change.
> -- Howard Chu
> Chief Architect, Symas Corp. Director, Highland Sun
> http://www.symas.com http://highlandsun.com/hyc
> Symas: Premier OpenSource Development and Support