[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Charset handling in the LDAP C API (Was: VM/ESA patches)



Julio Sánchez Fernández wrote:
> "Kurt D. Zeilenga" wrote:
> > Translation
> > between application required charsets and the API required
> > charset is left as an exercise for the developer.  (please see
> > LDAPext mailing list archives for background on this requirement).
> 
> You mean that it has been decreed that enhancing the API to provide
> other charsets is something to be discouraged?

No, I mean that the implementation (library) is limited by the protocol
specification and the implementation's knowledge (ie: no knowledge) of
the schema.  The library cannot assume all octet strings are translatable
(as was done in the U-Mich STR_TRANSLATION code).  It would be nice if
the wire protocol actually tag character string differently than arbitary
octet stream, but it doesn't.

> Can you point me to the mailing-list archive?

Much of the discussion dates back to draft-01 of the C-API which
required all strings be UTF-8.  I pointed out that LDAPv2 required
T.61 and suggested implementations to provide string translations.  It
was pointed out that doing string translations within the implementation
are problematic due to lack of schema knowledge.  I, in turn, suggested
that the API spec be modified to require UTF-8 when LDAPv3, T.61(or ASCII)
when LDAPv2.  draft-02 was modified per this suggestion.

Discussion of the string translation issues can be found on this list's
archives:
	http://www.openldap.org/lists/openldap-devel/

The spec api issues were also discussed on LDAPext.  Mark Wahl, I believe,
was maintaining an archive.  The IETF LDAPext page should have a reference,
but doesn't.  Anyone have a reference handy?

> BTW, this approach has the problem that translation happens too far
> away from the wire to know the encoding or am I missing something?

At the wire you have an octet string.  You don't know if it's a
translatable string or not.

> > They could be implemented using two pairs of translators per
> > local charset.
> >         ebcdic_to_t61()/t61_to_ebcdic()
> >         ebcdic_to_utf8()/utf8_to_ebcdic()
> >
> > or they could be implemented such that the t61 vs utf8 choice
> > was specified implicit with a session handle.
> >         ebcdic_to_ldap(ld, ...)
> >         ldap_to_ebcdic(ld, ...)
> >
> > or they could be implemented such that the local charset was
> > specified as an argument:
> >         ldap_encode(ld, "ebcdic", ...)
> >         ldap_decode(ld, "ebcdic", ...)
> 
> I essentially vote for the third method.  Or, as a matter of fact,
> I vote for the third method assisted by lower level, non-LDAP-specific
> routines.

Please note that all translators shown above must be explicitly called by
the application on a per string basis and do NOT change the behavior of
the primary API calls.  The third method only uses the ld session to
determine if the string should be encoded/decoded to/from UTF-8 or T.61
(by doing a ldap_get_option(ld, LDAP_OPT_PROTOCOL_VERSION, &ver)).

> The rationale is that I consider the set of translations open and, in fact,
> the required charset should be taken, barring any application override,
> from the locale (LC_CTYPE, typically). Say, a CGI script should be
> able to set any charset desired, but ldapsearch should use whatever
> the user requested and it is not unreasonable to assume that if LC_CTYPE
> is ISO-8859-2 then that is what should be dumped upon the user and
> not something else.

The ldap_search(3) call would always act exactly per spec, requiring UTF-8
for LDAPv3, T.61 for LDAPv2.  The strings would be written to wire without
any translation.

An application, ldapsearch(1) or a CGI app or whatever, would use an
auxilary API (ldap_encode/ldap_decode) to convert strings before/after
using primary API calls.  That is, NO translation is done by the any of
the primary API calls.

> > BTW, folks interested in designing or implementing a
> > charset translation infrastructure for OpenLDAP are
> > more than welcomed to do so.
> 
> I have started working on it.  I have a few questions:

Great!
 
>         - What component is responsible to present a uniform interface
>           to the backends, it is foreseeable that some backends might
>           use a storage charset different from UTF-8/T.61, especially
>           when a DMBS is behind the backend.  Whether the backends
>           themselves are responsible for this or backend.c, they need
>           to know what charsets are being used.  Should the slapd.conf
>           syntax be extended to specify this?  It seems that no matter
>           what, since we will be serving both T.61 and UTF-8, we need
>           to know which one are our backends using.

This is really quite a different issue from the library API issue.

The server itself must produce/consume UTF-8 when talking LDAPv3 and
produce/consume T.61 (or ASCII) when talking LDAPv2.

Assuming that the server stores strings in one of the two representations,
translation is required for one or the other protocols.    It must have
schema knowledge to known which attributes are character strings and hence
need to be translated.

Note:
  I do not know of any server which actual does this translation, most
  never translate strings.  That is, if you write a string using LDAPv3 and
  read it with LDAPv2 (or vise versa) they always are equal.

Personally, I would like to decree that that frontend<->backend interface
always utilize UTF-8 encoded character strings.  The conversion to/from
T.61/ASCII (for LDAPv2) would be done by the frontend.  Translation to/from
non-UTF8 character representation could be done by plugin (be_string_encode/decode)
or be done in a backend specific manner.

>         - Should the clients default as translating or non-translating?

This is a per application issue.  We'll eventually need to determine what the
default should be for OpenLDAP distributed clients such ldapmodify.  For now,
I rather focus on API issues.

Kurt