[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2




On Jun 16, 2009, at 12:32 PM, Howard Chu wrote:

Not to mention that to implement this properly will require complete schema knowledge at the time the LDIF is generated. (Otherwise, how do you distinguish a genuine octetString value, which cannot be safely represented in UTF-8, from a directoryString value...)

Well, one could scan the value to see the octets are a valid UTF-8 sequence of valid Unicode code points just as today most implementations scan data for octets within of SAFE-STRING. The significant difference is the check is straight forward in LDIFv1, as it's san octet-by-octet check. But if we allow UTF-8 sequences of valid Unicode points, each octet of the value must be checked to see that it's part of a valid UTF-8 sequence, and each UTF-8 sequence checked to see if encodes a valid Unicode code point. And then wrapping becomes more complicated, etc..

And even with all of that, LDIF would still not be well-formed Unicode text. And even if we solved that (by even more complex restrictions on what Unicode code point sequences can be represented as UTF-8 instead of base64-encoded UTF-8), we'd have the problem of unintended Unicode transformations in transporting LDIF. (We have this problem with LDIFv1, but it's generally limited to end-of-line characters. With UTF-8, data will be impacted. For instance, consider MUAs (or the like) that might convert (on send or receive) text to Net-Unicode.)

I've expanded my UTF-8 LDIF with some more goofiness.

version: 2

dn: cn=funky
bom:
smiley-face:â?º
# only SPACE is special
no-break-space:  
zero-width-space:â??
word-joiner:â? 
ideographic-space:ã??
zero-width-no-break-space:
# line separators and other such things
nel:Â?
ls:â?¨
ps:â?©
ff:
# these hyphen differ but may look the same
hyphen-minus:-
hyphen:â??
non-breaking-hyphen:â??
figure-dash:â??
en-dash:â??
minus-sign:â??
roman-uncia-sign:ð???
# these differ but may look the same
o-diaeresis:ö
o-diaeresis-decomposed:oÌ?
# ignorables
ignore:â? 
ignore:â?¡
ignore:­
# inside-out rule
inside-out:aÌ?Ì?̣̭
inside-out:�ึ�
# combining character
diaeresis:Ì?
# bidi
bidi:Ú?
bidi:Ù±ABÙ¹Ú?
bidi:Ù±37Ù¹Ú?
bidi:â?®ABCâ?¬
bidi:â?­Ù±Ù¹Ú?â?¬
# bidi wrapped
bidi:
 Ú?
bidi:Ù±A
 BÙ¹Ú?
bidi:Ù±3
 7Ù¹Ú?
bidi:â?®A
 BCâ?¬
bidi:â?­Ù±
 Ù¹Ú?â?¬
# private use
pu:î??î??î??



-- Kurt
_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext