[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
UTF-8 support, take 2
Well, I have been digging deeper into the problem. I now know a little
bit more about T.61, Unicode and several 8859 charsets. I also
understand a little better libraries/libldap/charset.c.
I am working towards the following scenario:
- The core in the LDAP servers (slapd and ldapd, but I cannot do any
testing of the latter) works in Unicode, UTF-8 most likely.
- These servers know enough of Unicode to do uppercasing for
case insensitive searches and indexing and reduction of
characters to their basic form (say 'a' from 'a acute') for
approximate searching. I know this is not right for all
languages, but it is better than what we have now and solving the
problem requires knowing the language (not just the charset) a string
or substring thereof is in, information not
currently available. Consider this a start.
- The backends *may* use a different encoding for storage and
help is provided for this, however, this is not recommended
unless there are serious reasons for this.
- The servers may listen on several ports and different codes
may be used in them.
- Default configuration is T.61 for V2 and UTF-8 for V3, however,
this can be overridden by the admin. The server
can determine what protocol version is being used without
error (or so is claimed in the V3 RFCs). If the client binds
first, it specifies the version. If it doesn't, it is V3.
- The servers only translate attributes of non-binary syntax.
- Translation from charsets can work in several configurable
modes (see later).
- The library (libldap) can also do translation, but only
T.61 or UTF-8 are supported on wire and a local charset at
the API side or else, translation can be disabled, where the
API talks the same code as the protocol, whatever it is.
There is the problem of binary attributes that cannot be
properly solved at this level. If translation is active,
the API will provide a, yet-to-be-decided method to have it
disabled for specific attributes.
- The code will minimize the performance hit of all this for
cases where the complete generality is not needed. In other
words, it will special-case for ASCII (IA5) and ISO 8859-1
(Latin1) where appropriate.
- During compilation, the supported charsets may be specified
so that unneeded charsets do not increase unnecessarily the
memory used. Another possible approach is to load tables
from disk as needed. It seems Mozilla does that. Anyway,
Unicode is big and it may be necessary to have a big chunk of it
accessible somehow.
As far as translation is concerned, there is the problem of characters
that exist in the source charset and don't exist in the target charset.
The current code (in charset.c) translates, unconditionally, those
characters to a form that is representable in the target charset.
However, this transformation is currently inconsistent (see code points
0xA6 and 0xA8 in T.61), non-reversible (e.g. the way '{' is dealt with),
difficult to extend to more code points in Unicode and not necessarily a
desired feature under all scenarios.
I propose to have two orthogonal, independently-chosen, options. The
first will control whether this latter kind of translation is done at
all. The second controls whether it is acceptable a "best-effort"
translation or the operation should fail in case a character code is
found on input that cannot be translated into the target charset. Note
that even if an attempt is made to translate non-representable codes to
some form, Unicode is so large that the possibility of finding something
that is not understood is significant. This may include the case of
malformed T.61 or UTF-8 sequences. Read-only scenarios may be liberal
in their translations, but when updates are possible, it may be more
convenient to be strict to minimize the chances of data corruption or
misinterpretation.
So, for the library, there are four options:
- Charset on wire, defaulted appropriately (this cannot be
determined until we know which protocol version we are talking)
- Charset on API, maybe defaulted from the locale
- Translate non-mappable chars
- Accept errors
Translation is the identity mapping if both charsets are equal.
For each server port, there are similar options. The ldapd will talk
T.61 with the X.500 DAP servers. The backends are on their own on this.
They will have access to the routines, but that's all.
This, of course, won't be available immediately, but rather in the
window for OpenLDAP 2.0.
Comments, please.
Now, a request. I need some info on T.61. In particular:
- Clarification on the meaning of code points 0xA6 and 0xA8
- What are code points 0xD8 to 0xDB and 0xE5?
- What diacritics are represented by 0xC0, 0xC9 and 0xCC?
- The list of known digraphs (the {xy} forms)
- Any useful Web-accessible resource
Would some kind soul with access to the standard provide any help in
this direction? Please.
Thanks in advance,
Julio