Sorry, I lost this:
Harald Tveit Alvestrand writes:
> I recommend basing the grammar on octets, and saying that you define > it like this.
I agree, plus a mention that this is after the file has been converted to UTF-8 if it was encoded differently (as in my `what is a character?' point). Also: Lines must not be wrapped in the middle of a "multi-octet UTF-8 character" (or whatever is the proper phrase), so UTF-8 LDIF files can be printed/edited by a program which handles UTF-8.
I agree. If you want a grammar for UTF-8, here's one from ACAP:
UTF8-1 = %x80-BF
UTF8-2 = %xC0-DF UTF8-1
UTF8-3 = %xE0-EF 2UTF8-1
UTF8-4 = %xF0-F7 3UTF8-1
UTF8-5 = %xF8-FB 4UTF8-1
UTF8-6 = %xFC-FD 5UTF8-1
UTF8-CHAR = TEXT-UTF8-CHAR / CR / LF
SAFE-UTF8-CHAR = SAFE-CHAR / UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6
(See the RFC for SAFE-CHAR; you probably want to roll your own) You can then say that folding cannot occur inside an UTF8-CHAR.
Harald
Harald
-- Harald Tveit Alvestrand, Maxware, Norway Harald.Alvestrand@maxware.no