[Date Prev][Date Next] [Chronological] [Thread] [Top]

Charset translations: Request for comments



Hi all,

I am working on providing the charset translations routines that
OpenLDAP needs to support a number of needs.  So that this is
understood in context, I will first present them.

The most important area comes from having to support both v2 and
v3 in the same servers.  That is, when a v2 connection is made to
slapd or ldapd, the standard requires that it speaks a CCITT (ITU)
charset known as T.61 that is spoken by the X.500 DAP.  However,
for v3 it is mandated that Unicode in its UTF-8 encoding is used
instead.  Since the databases are either in one format or the other,
the servers have to do the translation if the wrong protocol version
is chosen.  Besides, ldapd may need to translate from/to T.61 if it
is serving a v3 client.

We propose that in the future we keep our native databases in UTF-8
format because of two reasons:

	- It is the richest charset
	- Optimizes for the foreseen prevalence of v3 clients and
	  small, in relative terms, population of X.500 DAP servers

This takes us to the second area of application, that comes from the
need to support database backends that are not in our native charset
because of a number of reasons.  For example, the database might be
maintained or exploited by local applications outside the domain
of OpenLDAP.  In this case, the backend has to do the translation
between the local charset and UTF-8/T.61.

The third area of application comes from user-interface concerns.
Since user environments, except for a few cases, support either UTF-8
or T.61, all clients of the LDAP protocol need to translate between
the local charset and UTF-8/T.61 so that the user can obtain a
meaningful presentation of data and can modify it somehow.  Too
frequently, the user local charset is very limited and can only
represent a subset of the Unicode repertoire.  In many cases, the
subset is fully meaningful to the user, it is unlikely that she would
ever see a character not in her local charset nor have the need to type
it.  In other cases, the user must be given a representation of
characters outside her charset and, if at all possible, a method to type
them.  An approach for this is presented in RFC1345 and implemented,
with some differences, in libraries/libldap/charset.c, that is otherwise
too limited in its design for it to be extended for our new needs.

It is extremely unfortunate that the client *has* to know what attributes
are to be translated.  For the time being, we provide no guidance on
this (but this annoys me so much that I *will* work out something).

Did I leave out anything?

All that said, I have decided to make my current code available so that
developers and others interested have a glimpse at what I am doing and
maybe inspire comments that can enhance it or keep it from going too far
in the wrong direction before it is added to OpenLDAP.  To get the
current code, get your cvs client and do:

	cvs -d :pserver:anonymous@andromeda.stl.es:/home/OpenLDAP/cvsroot login

Password is 'charset'.

	cvs -d :pserver:anonymous@andromeda.stl.es:/home/OpenLDAP/cvsroot checkout translation

If you were in the directory that contains the OpenLDAP source directory (in
other words one level up from the OpenLDAP source) and you had run already
configure.  You may now:

	cd translation
	make

It might even compile.  The current arrangement is purely temporary, the
pieces will be distributed differently when it is committed to the
OpenLDAP CVS repository.

Now I present some highlights from the public interface (that can be
found in ldap_charset.h).  A Charset is a struct that contains all
necessary info to do translation from/to UTF-8.  Charsets come in
several flavours depending on its complexity.  For instance, memory used
to support the ISO-8859 'simple' charsets is small, T.61 is a 'composed'
charset with medium footprint.  Shift-JIS ISO-2022-JP looks like 'composed'
but other charsets to support CJK languages are in the 'double' category
and have large memory requirements.  I have implemented 'simple' and
'composed' charsets, at least in part, and I am not sure about the need
to support 'double' charsets, since the resulting 'composed' code would
support correctly 'double charsets' with little or no change.

But I really don't know.  I am very ignorant about CJK languages so this
part is going to be supported poorly for some time.

BTW, the discussion about TranslationMode below is tentative, nothing of
it is implemented yet (i.e. no digraphs yet, I need to load and use them
and pedantic checks for UTF-8 are not there yet either).  Otherwise, the
routines are functional.

There is no sample application code yet, I will create a working example
as soon as possible and put it there.

All that said, here we go:

/*
 * Charset is an opaque type.  All accesses to Charset happen through
 * pointers.  We provide a type stub here for applications to use.
 */
typedef struct Charset Charset;

/*
 * Used to create a Charset from a mapping from ftp.unicode.org if you
 * have them (we cannot legally distribute them).  Will return NULL
 * on error.  As a side-effect, adds the created Charset to the pool of
 * known charsets.
 */

LDAP_F Charset *
ldap_load_charset_from_unicode_mapping LDAP_P((char * name, char * filespec));

/*
 * This function creates a Charset from a WG15 charmap file.  These
 * files are maintained by Keld Simonsen.  A subset of these files is
 * distributed as part of the locale support on some systems, maybe in
 * /usr/share/i18n/charmaps.  Otherwise, the WG15 archive maintained
 * by Keld Simonsen can be found at:
 *
 *	ftp://dkuug.dk/i18n/WG15-collection
 *
 * This routine will return NULL on error.  As a side-effect, adds
 * the created Charset to the pool of known charsets.
 */

LDAP_F Charset *
ldap_load_charset_from_charmap_file LDAP_P((char * name, char * filespec));

/*
 * Returns the Charset giving its name or one of its aliases from the
 * pool of known charsets.  Returns NULL on error.
 */

LDAP_F Charset *
ldap_get_charset_by_name LDAP_P((char * name));

/*
 * Translation routines take an argument of type TranslationMode.
 * It is a bit-mask and its meaning is presented below.
 */

typedef short TranslationMode;
/* Croak if the request could not be fulfilled exactly */
#define TRANSMODE_STRICT	0
/*
 * Generate and accept digraphs in the style of RFC1345 for characters
 * not available in the Charset.  For characters for which a digraph form
 * is not known or cannot be represented in the Charset a sequence using
 * UHHHH is used instead where HHHH stands for the hexadecimal form
 * of the Unicode code position.  Digraphs or UHHHH forms are enclosed
 * by delimiters begin_char and end_char as required by the Charset.
 * These are typically either <> or {}.  If this bit is not set, any
 * attempt to map a character whose mapping is not known will make
 * the translation fail.
 */
#define TRANSMODE_DIGRAPHS	0x1
/*
 * Accept malformed UTF-8 sequences by making an effort to interpret
 * them.  Otherwise be pedantic and signal an error if the slightest
 * error is found.
 */
#define TRANSMODE_UTF8_ERRORS	0x2

/*
 * Important Note:  kdz has expressed doubts about this interface that
 * is inspired by the interface of the translation routines used by
 * liblber and libldap/charset.c.  So this may change as soon as we
 * make up our minds about what is best.
 */

/*
 * Translate a UTF-8 stream into a character stream expressed in Charset
 * cs.  Returns non-zero on error.  Input is an area pointed by *bufp with
 * length *buflenp.  Output will be found at *bufp and will have length
 * *buflenp.  The output area is allocated by this routine.  If free_input
 * is non-zero, the input area will be freed.
 */

LDAP_F int
ldap_utf8_to_charset LDAP_P((Charset * cs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

/*
 * Translate a character stream expressed in Charset cs into a UTF-8 stream.
 * Usage as above.
 */

LDAP_F int
ldap_charset_to_utf8 LDAP_P((Charset * cs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

/*
 * Translate a character stream expressed in Charset fromcs into a character
 * stream expressed in Charset tocs.  NULL in either Charset argument stands
 * for UTF-8.  Usage is otherwise as above.
 */

LDAP_F int
ldap_translate_charsets LDAP_P((Charset * fromcs, Charset * tocs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

/*
 * Translate a character stream expressed in Charset cs into a stream
 * adequate to be sent over the connection designated by ld.  Essentially
 * it translates into UTF-8 or T.61.
 */

LDAP_F int
ldap_encode LDAP_P((LDAP * ld, Charset * cs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

/*
 * Translate a character stream received from the connection
 * designated by ld into a character stream expressed in Charset cs.
 * Essentially it translates from UTF-8 or T.61.
 */

LDAP_F int
ldap_decode LDAP_P((LDAP * ld, Charset * cs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

/*
 * Generate a C file designated by filespec that defines a routine with
 * name charset_init_label (where label stands for the argument by that name).
 * When compiled and linked in, calling charset_init_label will make the
 * Charset designated by cs available for use by adding it to the pool of
 * known charsets.  Returns non-zero on error.
 */

LDAP_F int
ldap_dump_charset LDAP_P((Charset * cs, char * label, char * filespec));

Well, that was it.  Let me know what you think.

Julio