[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: equality indices & hashes

Hallvard B Furuseth wrote:

As far as I can tell, an equality index is an index of hashes of the
normalized values, not of the entire normalized values. Also, I think
an equality index can have hash collisions with substring indices.

Yes. Pretty sure I raised this issue a long time ago, when 2.1 was still in Alpha. I've definitely seen these collisions occur.

If so:

If a value's equality hash collides with a substring hash which matches
a lot of values (e.g. the last 4 characters of an e-mail address), it
will be impossible or very slow to look up that value by an equality
filter. I expect the same can happen with presence indices, though I
haven't checked the code for that. It will be rare, but when or if it
happens, it will be experienced as a _very_ mysterious bug.

The key for presence indexing is not generated by the frontend. In either back-bdb or back-ldbm the likelihood of a collision is vanishingly small. Nor is there any consequence if any collision with the presence key does occur.

So I suggest that 2 bits of the hash value are reserved to indicate
whether it is an equality, presence, substring or approx hash.

Sounds like a good idea. Right now an indicator is fed into the hash input, and that obviously is still prone to collisions. Explicitly masking it onto the hash output would cure that issue. Also sounds like something that will have to wait for RE23 since it will cause an incompatible format change.

 -- Howard Chu
 Chief Architect, Symas Corp.       Director, Highland Sun
 http://www.symas.com               http://highlandsun.com/hyc
 Symas: Premier OpenSource Development and Support