[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LDAPprep: mapping of " " values



Rats, sending email too late again. The example should have *started*
with the string U+0020 U+0301, rather than including it as an interior
string. It is a demonstration that a non-empty string can *start* with
a space (although not end with one).

Combining marks, in general, create some interesting edge cases for
substring matching, which is the only place where the decision to
use NFKC rather than NFKD makes a significant difference. The most
significant difference will come with substring matches in Korean;
form NFKC requires Hangul Jamo to be composed into Hangul syllables,
which means that the individual Jamo are not available for substring
matching. I don't speak Korean so I cannot tell if a native speaker
would consider it desirable to, for example, do a substring match
on an initial consonant rather than having to create an "or" of the
several hundred syllables which start with that consonant.

On 16-Nov-04, at 2:01 AM, Rici Lake wrote:


On 16-Nov-04, at 1:38 AM, Steven Legg wrote:

Alternatively,
LDAPprep can just reduce consecutive whitespace to a single space in every
case and leave the syntaxes draft to nominate the circumstances under
which a leading or trailing space is to be removed.

This seems very sensible to me.

A value can only match (l= *) or (l=* ) if it is all whitespace.

Not quite true. According to the insignificant space deletion rule,
a space is only a candidate for deletion if it is not followed by
a combining mark. Consequently, the sequence U+0041 U+0020 U+0020 U+0301
will be not be altered by ldapprep. (The string is LATIN CAPITAL LETTER A,
SPACE, SPACE, COMBINING ACUTE ACCENT.) One doesn't have to go out of one's
way to produce that sequence; the sequence U+0041 U+0020 U+00B4
(LATIN CAPITAL LETTER A, SPACE, ACUTE ACCENT) will be ldapprep'd into the
first sequence, as a result of the compatibility decomposition of U+00B4
into U+0020 U+0301. (That is, both strings render as "A ´" for anyone
with Unicode mail readers.)


Consequently, even (l= * *) could match something, but I believe that
(l= * * ) is truly impossible.