[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2



Another issue...

Though LDIFv1 doesn't actually allow multi-byte UTF-8 anywhere (except in base64 encoded strings), RFC 2849 did say: Implementations SHOULD NOT fold lines in the middle of a multi-byte UTF-8 character. (this was a left over, I think, when non-base64 encodings of UTF-8 was being considered for LDIFv1. That is, the IETF seems to have already considered this issue when it was developing LDIFv1 but choose to require base64 of multi-byte UTF-8 characters.) Anyways, if we remove the ASCII restriction and leave the above SHOULD, that would mean that implementations would have to deal with folding in the middle of multi-byte UTF-8 sequences, and that means the file as whole might not be valid UTF-8, and that would be problematic. If we do remove the ASCII restriction (which I don't support), I think we install a couple of restrictions. 1) The resulting file MUST only be valid UTF-8 (minimally length encodings, a start octet must be followed by continuation octets, etc.) AND MUST be a valid Unicode code point (U+0000 through U+D7FF, U +E000 through U+10FFFF, inclusively) excluding the NUL (U+0000) code point. 2) Lines longer that 76 octets MUST be folded. Folding of lines MUST NOT occur in the middle of a multi-byte UTF-8 character. Implementations need to take care that both of these requirements are met. Both of these requirements will significantly increase the complexity of LDIF encoders. Though the first is similar to current requirement to ensure that values encodes as SAFE-STRING met all of SAFE-STRING requirements, these checks are octet by octet. Checking for valid UTF-8 and Unicode requires both an expansion of each UTF-8 sequence but the code point represented by that UTF-8 sequence. And the second requires a bit more math, something experience has show implementors tend to get wrong. We'd also need text explaining that format does not ensure the UTF-8 represents well-formed text, or even text, and hence may not be displayable or editable. And even displayable, users must be careful not to rely on visual inspection. Also, users must be careful to avoid passing LDIF data through systems that might apply Unicode normalization or other transformations. Such systems will likely lead to unintended changes to the LDAP data represented by the original LDIF (and/or cause the data to be non- conforming LDIF). For instance, when emailing an LDIF file, both the sending and receiving user must be careful that their MUAs (and MTAs) are not altering the LDIF data. Unfortunately, many MUAs and MTAs do alter text attachments (to conform to various conventions). While such alterations may be desirable for text attachments, it is not desirable for LDIF data (which are commonly sent as sent as text attachments). By raising these points, I hope to show that simply removing the ASCII restrictions will lead to problems, and even though some of them (such as invalid UTF-8) can be address by installing various restrictions, the simple fact that LDIF represents LDAP data not text will lead to various problems (such as unintended text file conversions, inability to use text processing programs, etc.).
-- Kurt
_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext