[Date Prev][Date Next] [Chronological] [Thread] [Top]

Unicode string profiles for LDAP



The LDAPv3 technical specification ambiguously defines matching
rules which acts upon values of Unicode-based syntaxes such as
DirectoryString.  For example, the specification does not state
which the direction characters should be folded when doing
case ignore matching.  Unlike IA5 matching, the direction matters
for Unicode strings.

draft-hoffman-stringprep-02.txt details three areas which we need
to profile Unicode string handling: input mapping, normalization,
and output prohibitions.  We also need to address transliteration
issues.  In this post, I've provided some thoughts in this area.

I think the specifications of Directory String and other
character string syntaxes should continue to allow transfer
of the full repertoire of characters (including unassigned
characters) and that the burden of preparation be placed upon
the application (e.g., the server) which is doing the comparison.
This ensures that LDAP can continue to be updated to support
additions to the character sets with minimal impact to existing
implementations.  It also allows introduction of matching rules
which have conflicting preparation requirements.

Transliteration:  We currently don't mandate transliteration
support.  For implementations which do transliteration, we need to detail
how transliteration is to be done (or say its implementation
specific).  I suggest we say that comparisons between different
CHOICEs are to performed after both strings transliterated to Unicode.
We then need to specify how each of the CHOICEs are to be transliterated
to Unicode.  And, as I think we should avoid per CHOICE preparation
requirements, we should state that comparisons between same CHOICEs
are to behave as if strings had been transliterated to Unicode.

Mapping of input:  select "invisible" characters should be
mapped out (removed) prior to normalization, including soft
hyphen and such.  For case insensitive matching rules, upper
case characters will be mapped to lower case.

Normalization: to Unicode KC form.

space: character string matching rules ignore "insignificant"
spaces.  We need to define which code points are considered spaces.
We likely should map all spaces to SPACE (U+00020) on input.

Unassigned characters:  The output cannot contain any unassigned
characters.  Hence, an assertion involving any unassigned character
will be Undefined.  The spec will list the set of unassigned characters
which all implementations MUST recognize as such.  When new characters
are assigned in Unicode, this specification can then be updated.  This
will cause some comparisons which previously evaluated to Undefined to
evaluate to True or False.

Currently ORDERING would remain implementation specific.

IMO, the clarifications of attribute/assertion value handling in
comparisions should be done in 2252bis where the syntaxes and matching
rules are specified.  It may also be appropriate to include in 2252bis
(non-normative appendix?) a discussion of the design choices made as
the choices may not obvious to the reader.

Kurt