[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: UTF-8 support, take 2

Julio Sanchez Fernandez writes:

> What is not planned to be supported is having, say, ISO 8859-5 on wire
> and ISO 8859-1 at the API provided by the library.

Oh, right.  Seems we were saying the same thing in different ways.

>>  * T.61<->unicode (...),
>>  * local charset <-> unicode (...)
> These two are relatively easy in the forward direction.  The reverse
> direction requires sparse tables, however.  A good design of the
> translation tables is critical.  I tried hash tables but could not
> find efficient hash functions that don't make a mess of the tables
> with collisions when tried with real data extracted from the charset
> tables.

Maybe a it's already solved.  Check <URL:http://www.unicode.org/> and
<URL:ftp://ftp.unicode.org/Public/> for a translation library.  And/or
ask the unicode mailinglist, unicode@unicode.org.

>> Not really.  Most data will be translatable to latin-1, since that's
>> what most of those who put data in the directory can handle.
> For the time being, it is.  But then we are not alone in the planet.

Good point, but still - very often, most or all data will be
translatable to the user's charset, because most directory operations
are on local/national data.

>> We may want to specify in which cases translation is done in the client:
>>  * whether or not to translate attributes with DN syntax,
> Why? Can you explain?  In any case, DNs in V3 are UTF-8 by definition.

Since DNs are sometimes data and sometimes text.  They are data when
used as base DNs to further directory operations - then it's OK to get a
reversible auto-tramslation.  They are data when generating certificates
and such things - then there must be *no* translation.

They are text when displayed, e.g. when we do ldap_explode_dn and
display the RDN.  Then we'll often want approximate translation to the
local charset, as with other text data.  Clients' authors are often lazy
and assume - or know - that the clients only work with data which can be
translated to the local charset - then they'll want auto-transation of
DNs to and from the local charset and forget about charset issues.
Well, I suppose a reversible auto-translation is best in that case.

DNs are not the only "both data and text" type, of course.  Just the
most prominent one.

>>  * more generally: which attributes and/or syntaxes to translate,
> Yes.  According to the specs, syntaxes have defined representations
> and this *should* be the right method at the server.  The client is
> going to need to know about the schema somehow to do this.

I was thinking the oppisite way: The client may want to tell the library
to (not) auto-translate certain attributes.  Maybe that would simpily
mean to override the syntax of some of the attributes (e.g. set an
attribute's syntax to "bin" to avoid auto-translation).