[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: UTF-8/T.61 API Proposal



At 04:14 PM 11/13/98 -0800, John Kristian wrote:
>How will client software know which attributes are character strings, and
>which aren't?

By client software, I assume you mean the implementation of the C API.
The implementation of the C API knows which attributes are character
strings by their ASN.1 encoding.

>Design for modular addition of a new charset (in this case, algorithms to
>convert between a new charset and UTF-8).  Conversion algorithms are very
much
>a matter of local preference.

Yes, that's why I think it is important to provide a simple mechanism to
specify alternative character sets.  I am thinking of proposing a
"named translation" mechanism.

>And there is far too much conversion software
>in use to attempt to collect it all into one library.

I don't intent to reimplement conversion software.  We just need
to collect the necessary routines (utf-8<->8859) to support "local"
character set translation off the net.

---

Here is a bit more discussion on design/implementation of
string translation (in the context of ldap-c-api-01).

Many implementations, like OpenLDAP and Netscape) already provide
hooks in their BER string encode/decode functions to support
alternative character translations.  

Assuming no alternative translations have been requested by the
application, the implementation is required to convert strings being
read off of LDAPv2 session to UTF-8 and the inverse when writing.
For LDAPv3, no translation is required as the strings are already
in UTF-8.

(all code is psuedo)

Step 1. (no alternative string translations)
To implement this just requires us to have t61_to_utf8() and
utf8_to_t61() functions and plug them into the ber encode/decode
string functions.

Step 2. (alternative string translations)
BER encode/decode functions will each be extended to support
two translation hook instead of one.
They will look something like:
	(*decode_wire_to_internal)()
	(*decode_internal_to_external)()
and
	(*encode_external_to_internal)()
	(*encode_internal_to_wire)()

The default settings for these will be:
	decode_wire_to_internal = t61_to_utf8;
	decode_internal_to_external = NULL;
and
	encode_external_to_internal = NULL;
	encode_internal_to_wire = utf8_to_t61;


Lets say we have a mechanism to specify alternative encoding
by name... something like
	ldap_set_option(ld, LDAP_API_FEATURE_X_CHARSET, "T.61");

To implement this particular translation, the implementation could
either (assuming the session is still v2):
	decode_wire_to_internal = NULL;	/* don't translate to UTF8 */
	encode_internal_to_wire = NULL;	/* don't translate from UTF8 */
or:
	decode_internal_to_external = utf8_to_t61;
	encode_external_to_internal = t61_to_utf8;

This later case requires strings to be translated from t61->utf8->t61.
Is bypassing the UTF-8 translation good or bad?  It depends.
Our default will likely to bypass to ensure that an updated application
(which used the Umich API/LDAPv2) can upgraded without causing
translation irregularities.

I am not sure if other named translations should go through utf8
or not.  But I am sure we'll have plenty of debate here. 

Of course, "named translations" is only simple interface to translation.
I do not expect implementations to provide but a few, if any, named
translation (as the burden is on them).  In fact, problably just
whatever the "local" character set is and "T.61".  These named
translation should cover most applications needs.

For the few applications that need custom translations, I plan to
developing an API that will provide the application with complete
control over translation.  Implementing the API for this is actually
fairly straight forward.  The burden here is on the application
developer to 1) provide the translation routines and
2) use the API properly.  More on this later.

Kurt