[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LDAPprep: mapping of " " values




On 19-Nov-04, at 12:05 PM, Kurt D. Zeilenga wrote:

But won't match, as I believe was intended, "JohnEllsworth".

Actually, I think the average person searching a directory would be surprised at a concatenated match like that. It is actually quite common for people to do searches which are effectively (with my proposal)


  (cn=* J* Ell*)

Which they would probably expect to match "John Ellsworth" and "Jim Ellsworth" but not "Jell-O Corporation of America". But maybe I'm wrong. (Had they wanted the more inclusive match, they could always have left out the second space, so it may be reasonable to assume they put it there for some reason.)

But won't match, as I believe was intended, "Angola&Mozambique".

Ah, well, there you have me. It wouldn't, and that was perhaps not a great example. Nor, as written, would it match "Angola, Mozambique, and other southern african countries". Matching is an inexact science :) And in any case, it is reasonable to expect that Mozambique is not going to show up as part of a word, so the word-break comparison is maybe not so useful.


It would be possible to define a different substring matching rule based, for example, on the Unicode Technical Report which suggests a word break algorithm. However, that would be a lot more work for implementors, and in any event I don't believe that report is normative. It would certainly not be a good idea to attempt to encode in a formal specification something like the behaviour of Perl's zero-length word-break assertions.

The modest suggestion I made struck me as a reasonable sort of compromise between utility and ease of specification/implementation. But it was just a little suggestion; take it for what it's worth.

What we need to do is treat spaces in substrings in
relationship to other substrings and the attribute
value.


Sure. That's more or less what I was getting at. It was just one way of working such a relationship.


I do think that the distinction between spaces (as per strprep, "not following by a combining character") and U+0020 is important, although not world-shattering. There will be U+0020 U+0301 sequences post-strprep, and they really should not cause false matches. Treating them as the same character in a (substring) search but handling them differently during insignificant space removal is one of those things that comes back to bite you.

R.