[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2




On Jun 18, 2009, at 3:57 AM, Michael Ströder wrote:

Kurt Zeilenga wrote:

IDNA when through all of this.  They found that they had to place
significant restrictions on Unicode domain components to ensure that a
domain name was well-formed Unicode text.

I'd like to learn more about the term "well-formed Unicode text". Do you
have a reference at hand? [NAMEPREP] and/or [STRINGPREP]?

Unfortunately, I don't have a good reference handy. The Unicode community might actually use different terms. I tried to loosely define the terms in a prior email. (I'm not a Unicode expert, just someone who's been involved Unicode issues (such as with IDNA) for a number of years.)

To rephrase MY definitions:

"text" merely implies that the sequence of Unicode code points represents a character. In my ldif example, there is a colon followed by a combining code point. This is an example of a sequence which doesn't represent "text".

"well-formed text" implies that not only is the sequence is "text" but that various other rules are met. For instance, the sequence will result in proper directional display of bidirectional text. There are some examples in the LDIF which show that introduction of line wrapping can break the directional display of the value.



I found

http://www.unicode.org/versions/Unicode5.1.0/#Conformance_Changes

which contains a replacement for the text in Unicode5.0 standard.
(Strange that one cannot simply download the recent version.)

You can download each chapter of the current version (each has a front page detailing copying restrictions, etc.).


You have not suggested placing similar restrictions on LDIF but simply
removing the ASCII restriction.

Would it help to define similar restrictions?

First, I don't see how any of this helps in LDAP data interchange, the primary purpose of LDIF.

Second, if one were to say that the resulting file has to be Net- Unicode (which I think at least means the file is "text"), you run into "data loss" problems due to unintended transformations.

Stepping back a bit from the details of the interesting Unicode issues
posted here I wonder what the general strategy of the IETF regarding
these issues is?

Punt.

I remember discussions on the ietf-pkix mailing list
mentioning problems like these (e.g. when displaying subject names of
X.509 certs) without any real solution.

I think any system which takes (user) input, decodes it to a Unicode
code point sequence and display it to the user is affected by issues
with BIDI, combining characters and duplicate Unicode points.

Yes. The IETF tends to punt such issues to the user interface development community. The IETF tends to restrict itself to design of protocols not design user interface (though the IETF does try to document user interface issues, especially those with security impact).

I think of LDIF as an alternative encoding of protocol data units, used for out-of-band transfer data between protocol peers. That is, I punt the "user" as far as I can.

Others see LDIF as a user display format and user input format for LDAP data. I argue that LDIFv1 didn't handle this well for ASCII and that handling this for ASCII (without data loss) is hard. Solving it for Unicode, well, that's very, very hard.

-- Kurt
_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext