[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Charset handling in the LDAP C API (Was: VM/ESA patches)




"Kurt D. Zeilenga" wrote:

> Translation
> between application required charsets and the API required
> charset is left as an exercise for the developer.  (please see
> LDAPext mailing list archives for background on this requirement).

You mean that it has been decreed that enhancing the API to provide
other charsets is something to be discouraged?  Did I get that right?
Can you point me to the mailing-list archive?  I don't seem to be
able to find it.

BTW, this approach has the problem that translation happens too far
away from the wire to know the encoding or am I missing something?

> They could be implemented using two pairs of translators per
> local charset.
>         ebcdic_to_t61()/t61_to_ebcdic()
>         ebcdic_to_utf8()/utf8_to_ebcdic()
> 
> or they could be implemented such that the t61 vs utf8 choice
> was specified implicit with a session handle.
>         ebcdic_to_ldap(ld, ...)
>         ldap_to_ebcdic(ld, ...)
> 
> or they could be implemented such that the local charset was
> specified as an argument:
>         ldap_encode(ld, "ebcdic", ...)
>         ldap_decode(ld, "ebcdic", ...)

I essentially vote for the third method.  Or, as a matter of fact,
I vote for the third method assisted by lower level, non-LDAP-specific
routines.

The rationale is that I consider the set of translations open and, in fact,
the required charset should be taken, barring any application override,
from the locale (LC_CTYPE, typically).  Say, a CGI script should be
able to set any charset desired, but ldapsearch should use whatever
the user requested and it is not unreasonable to assume that if LC_CTYPE
is ISO-8859-2 then that is what should be dumped upon the user and
not something else.

> BTW, folks interested in designing or implementing a
> charset translation infrastructure for OpenLDAP are
> more than welcomed to do so.

I have started working on it.  I have a few questions:

	- What component is responsible to present a uniform interface
	  to the backends, it is foreseeable that some backends might
	  use a storage charset different from UTF-8/T.61, especially
	  when a DMBS is behind the backend.  Whether the backends
	  themselves are responsible for this or backend.c, they need
	  to know what charsets are being used.  Should the slapd.conf
	  syntax be extended to specify this?  It seems that no matter
	  what, since we will be serving both T.61 and UTF-8, we need
	  to know which one are our backends using.
	- Should the clients default as translating or non-translating?
	  In either case, a command flag is needed to ask for the
	  opposite.

My current plan is to use the tables prepared by Keld Simonsen that are
available at:

	ftp://dkuug.dk/i18n/WG15-collection/charmaps

The goal is to load them at runtime (some people will have noticed that
these tables are already included as a part of glibc systems, typically
at /usr/share/i18n/charmaps).  A small subset of those charmaps would be
shipped with OpenLDAP for the majority of users not having glibc.  The full
charmap collection is some 10M and many of them are in the esoteric category,
so I think putting the whole collection would be unwarranted.  I don't know
the exact distribution terms of this, but it is being distributed by RedHat
and others, so there cannot be unsolvable problems.

The table for T.61 (and maybe others) needs some work, but I hope to get a
more or less general solution.  The common ASCII/ISO-8859-1 translations
may be special-cased.

Julio