[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
On Jun 18, 2009, at 3:57 AM, Michael Ströder wrote:
Kurt Zeilenga wrote:
IDNA when through all of this. They found that they had to place
significant restrictions on Unicode domain components to ensure
that a
domain name was well-formed Unicode text.
I'd like to learn more about the term "well-formed Unicode text". Do
you
have a reference at hand? [NAMEPREP] and/or [STRINGPREP]?
Unfortunately, I don't have a good reference handy. The Unicode
community might actually use different terms. I tried to loosely
define the terms in a prior email. (I'm not a Unicode expert, just
someone who's been involved Unicode issues (such as with IDNA) for a
number of years.)
To rephrase MY definitions:
"text" merely implies that the sequence of Unicode code points
represents a character. In my ldif example, there is a colon followed
by a combining code point. This is an example of a sequence which
doesn't represent "text".
"well-formed text" implies that not only is the sequence is "text" but
that various other rules are met. For instance, the sequence will
result in proper directional display of bidirectional text. There are
some examples in the LDIF which show that introduction of line
wrapping can break the directional display of the value.
I found
http://www.unicode.org/versions/Unicode5.1.0/#Conformance_Changes
which contains a replacement for the text in Unicode5.0 standard.
(Strange that one cannot simply download the recent version.)
You can download each chapter of the current version (each has a front
page detailing copying restrictions, etc.).
You have not suggested placing similar restrictions on LDIF but
simply
removing the ASCII restriction.
Would it help to define similar restrictions?
First, I don't see how any of this helps in LDAP data interchange, the
primary purpose of LDIF.
Second, if one were to say that the resulting file has to be Net-
Unicode (which I think at least means the file is "text"), you run
into "data loss" problems due to unintended transformations.
Stepping back a bit from the details of the interesting Unicode issues
posted here I wonder what the general strategy of the IETF regarding
these issues is?
Punt.
I remember discussions on the ietf-pkix mailing list
mentioning problems like these (e.g. when displaying subject names of
X.509 certs) without any real solution.
I think any system which takes (user) input, decodes it to a Unicode
code point sequence and display it to the user is affected by issues
with BIDI, combining characters and duplicate Unicode points.
Yes. The IETF tends to punt such issues to the user interface
development community. The IETF tends to restrict itself to design of
protocols not design user interface (though the IETF does try to
document user interface issues, especially those with security impact).
I think of LDIF as an alternative encoding of protocol data units,
used for out-of-band transfer data between protocol peers. That is, I
punt the "user" as far as I can.
Others see LDIF as a user display format and user input format for
LDAP data. I argue that LDIFv1 didn't handle this well for ASCII and
that handling this for ASCII (without data loss) is hard. Solving it
for Unicode, well, that's very, very hard.
-- Kurt
_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext