[Date Prev][Date Next] [Chronological] [Thread] [Top]

UTF-8 support, take 2



Well, I have been digging deeper into the problem.  I now know a little
bit more about T.61, Unicode and several 8859 charsets.  I also
understand a little better libraries/libldap/charset.c.

I am working towards the following scenario:

	- The core in the LDAP servers (slapd and ldapd, but I cannot do 	  any
testing of the latter) works in Unicode, UTF-8 most 	  likely.
	- These servers know enough of Unicode to do uppercasing for
	  case insensitive searches and indexing and reduction of
	  characters to their basic form (say 'a' from 'a acute') for
	  approximate searching.  I know this is not right for all
	  languages, but it is better than what we have now and solving 	  the
problem requires knowing the language (not just the 	  charset) a string
or substring thereof is in, information not
	  currently available. Consider this a start.
	- The backends *may* use a different encoding for storage and
	  help is provided for this, however, this is not recommended
	  unless there are serious reasons for this.
	- The servers may listen on several ports and different codes
	  may be used in them.
	- Default configuration is T.61 for V2 and UTF-8 for V3, 		  however,
this can be overridden by the admin.  The server
	  can determine what protocol version is being used without
	  error (or so is claimed in the V3 RFCs).  If the client binds 	 
first, it specifies the version.  If it doesn't, it is V3.
	- The servers only translate attributes of non-binary syntax.
	- Translation from charsets can work in several configurable
	  modes (see later).
	- The library (libldap) can also do translation, but only
	  T.61 or UTF-8 are supported on wire and a local charset at
	  the API side or else, translation can be disabled, where the
	  API talks the same code as the protocol, whatever it is.
	  There is the problem of binary attributes that cannot be
	  properly solved at this level.  If translation is active,
	  the API will provide a, yet-to-be-decided method to have it
	  disabled for specific attributes.
	- The code will minimize the performance hit of all this for
	  cases where the complete generality is not needed.  In other
	  words, it will special-case for ASCII (IA5) and ISO 8859-1
	  (Latin1) where appropriate.
	- During compilation, the supported charsets may be specified
	  so that unneeded charsets do not increase unnecessarily the
	  memory used.  Another possible approach is to load tables
	  from disk as needed.  It seems Mozilla does that.  Anyway,
	  Unicode is big and it may be necessary to have a big chunk 		  of it
accessible somehow.

As far as translation is concerned, there is the problem of characters
that exist in the source charset and don't exist in the target charset.
The current code (in charset.c) translates, unconditionally, those
characters to a form that is representable in the target charset. 
However, this transformation is currently inconsistent (see code points
0xA6 and 0xA8 in T.61), non-reversible (e.g. the way '{' is dealt with),
difficult to extend to more code points in Unicode and not necessarily a
desired feature under all scenarios.

I propose to have two orthogonal, independently-chosen, options.  The
first will control whether this latter kind of translation is done at
all.  The second controls whether it is acceptable a "best-effort"
translation or the operation should fail in case a character code is
found on input that cannot be translated into the target charset.  Note
that even if an attempt is made to translate non-representable codes to
some form, Unicode is so large that the possibility of finding something
that is not understood is significant.  This may include the case of
malformed T.61 or UTF-8 sequences.  Read-only scenarios may be liberal
in their translations, but when updates are possible, it may be more
convenient to be strict to minimize the chances of data corruption or
misinterpretation.

So, for the library, there are four options:

	- Charset on wire, defaulted appropriately (this cannot be
	  determined until we know which protocol version we are 		  talking)
	- Charset on API, maybe defaulted from the locale
	- Translate non-mappable chars
	- Accept errors

Translation is the identity mapping if both charsets are equal.
For each server port, there are similar options.  The ldapd will talk
T.61 with the X.500 DAP servers. The backends are on their own on this.
They will have access to the routines, but that's all.

This, of course, won't be available immediately, but rather in the
window for OpenLDAP 2.0.

Comments, please.

Now, a request.  I need some info on T.61.  In particular:

	- Clarification on the meaning of code points 0xA6 and 0xA8
	- What are code points 0xD8 to 0xDB and 0xE5?
	- What diacritics are represented by 0xC0, 0xC9 and 0xCC?
	- The list of known digraphs (the {xy} forms)
	- Any useful Web-accessible resource

Would some kind soul with access to the standard provide any help in
this direction?  Please.

Thanks in advance,

Julio