[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: String conversion routines in LDAP SDK



At 01:02 PM 10/23/00 -0600, Dave Steck wrote:
>We've had requests for routines in the LDAP SDK to convert between Multibyte, Unicode, and UTF-8 strings.

UTF-8 is a multiple-byte representation of Unicode.
There are, of course, other multiple-byte representations
(e.g. UTF-16) as well as wide character representations
(e.g. UCS-4, UCS-2) of Unicode.

ISO C provides (barely) supports multiple-byte and wide
characters (namely encoding conversions routines, but doesn't
mandate any particular representation or character set.  The
behavior of these conversion routines depends on
the current locale.

So, with that said, what do you mean by
>Our developers would like a cross-platform way of doing this. 

Doing what?
  Converting between encodings of Unicode?
  Translating between charsets?
  Both?

Conversion/Translation between UTF-8 and C multiple-byte/wide
characters generally requires both encoding conversion and charset
mapping.  If the C multiple-byte/wide characters are
some (detectable) encoding of Unicode, than there is no
translation, just conversion.  However, in general, charset
translation is required.  This requires an mechanism to detect
which charset is being used and routines to do the actual
translation.

I suggest that if any charset translation is required, that
caller be required to provide routines for doing such.  If
not provided, the library would assume the local supports
Unicode and, in particular, that wide characters are UCS-4
or UCS-2 and multiple byte characters are UTF-8.  I would
suggest that an routine to do such conversion be provided
as an additional argument.

As noted above, the implementation should support both
UCS-4 and UCS-2.  The selection of which can be likely
be dependent upon sizeof(wchar_t).  If only UCS-2 is
supported and the UTF-8 character doesn't fit, ERANGE
or other suitable error should be indicated.

If the UTF-8 character is longer than MB_CUR_MAX, ERANGE
or other suitable error should be indicated.


>Would OpenLDAP be open to the idea of adding such functionality?

Adding some basic support for Unicode encoding conversion
sounds reasonable.

The only charset translation I wouldn't mind seeing supported
is T.61 <-> Unicode (UTF-8)... but then again I wouldn't mind
just letting LDAPv2 (which mandates T.61 not UTF-8) whither.