[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: UTF-8 support, take 2



Sorry for the delay in answering, I have been in deep hack mode for
a related, but different, endeavour.

Hallvard B Furuseth escribió:
> 
> Some info and suggestions.  (I don't propose that you try to add all
> the suggestions at first - start with a smaller implementation and
> just keep the possible extensions in mind, and let's add the more
> fancy stuff later.)

Right.

> I'll be glad to test an UTF-8 ldapd a bit.

Good.

> BTW, try to use macros and #ifdefs for the "internal charset" choice.
> For servers with Non-European character sets, it may be a speedup to
> use another encoding like UCS-4 - at least for some features, like
> case-insensitive indexes.  That might also make it easy to fix the
> code to handle "local charset in the server" - we just #ifdef out
> the charset handling, and fix case-insensitive indexing.  Or
> something like that.

Well, let's see how it develops.

> >       - The library (libldap) can also do translation, but only
> >         T.61 or UTF-8 are supported on wire
> 
> Then I'll be adding "local charset on wire" support, if it's easy
> enough.

I did not make myself clear.  You have "local charset on wire" if you
setup a server that provides that charset (*this* is planned) and
disable translation (or you just don't support it) on the client.  The
client works in the same charset as that transmitted on wire.

I understand this is what your legacy clients need.

What is not planned to be supported is having, say, ISO 8859-5 on wire
and ISO 8859-1 at the API provided by the library.  If you want
charset translation on the client side, you talk T.61 or UTF-8 on wire.

The rationale behind is that you have a core working on Unicode and
all translations will use that representation as an intermediate step.
Character description tables are kept in terms of Unicode and is the
level at which "smarts" can be implemented.

In that scenario, Unicode/UTF-8 is the favored representation and the
one that gets the least performance hit.  This makes sense since it
is expected to be the most used representation in the future.  T.61
is supported for backwards compatibility with V2 (and is needed in
ldapd).  Local charset on wire translation is supported on the server
side to cater for the needs of legacy clients during a possibly endless
transitional period, but new clients linked with the new library do
not need anything else.

Anyway, let's see how it develops.  It might end up being trivial to
go for the whole nine yards once the basic pieces are in place.

> You won't need to load in *all* of Unicode.  I think you'll need
> 
>  * upper<->lower and accented->ascii for quite a number of characters,

The requirements increase depending on the charsets supported.

The first four pages of Unicode (1024 positions) plus a few sparse
characters cover the complete ISO 8859 latin set.  And that includes
all European languages (and a few others) written with the latin
alphabet.  Adding three more pages add support for cyrillic character
sets plus Arabic and Hebrew.

Then comes the CJK languages that have anyway huge requirements.

>  * T.61<->unicode for the few characters in T.61 (that is, "few"
>    compared to unicode),
>  * local charset <-> unicode (usually for <200 characters).
>    (Of course, you'll have to load that from disk unless the local
>    charset is set at compile time.)

These two are relatively easy in the forward direction.  The reverse
direction requires sparse tables, however.  A good design of the
translation tables is critical.  I tried hash tables but could not
find efficient hash functions that don't make a mess of the tables
with collisions when tried with real data extracted from the charset
tables.  Not that the results are unusable, but I'd rather have
something that can be done with few multiplications and divisions,
since the hash function has to be evaluated many times, once per
char in some cases.

> Not really.  Most data will be translatable to latin-1, since that's
> what most of those who put data in the directory can handle.

For the time being, it is.  But then we are not alone in the planet.

> Note that allowing and translating malformed sequences can open
> security holes at times, in particular with UTF-8.  See Security
> Considerations in rfc2279.

I'll have a look at it.

> We may want to specify in which cases translation is done in the client:
>  * whether or not to translate attributes with DN syntax,

Why? Can you explain?  In any case, DNs in V3 are UTF-8 by definition.

>  * more generally: which attributes and/or syntaxes to translate,

Yes.  According to the specs, syntaxes have defined representations
and this *should* be the right method at the server.  The client is
going to need to know about the schema somehow to do this.

In any case, the set of syntaxes known by the Umich and OpenLDAP
servers (or Netscape's for that matter) is very small.  For instance,
an attribute with syntax 'Directory String' is UTF-8, while another
with syntax 'IA5 String' is pure ASCII, but they have to be defined
both as either "ces" or "cis".  So I think all syntaxes except
"bin" (i.e. "ces", "cis", "tel" and "dn") should be translated.

> Not part of T.61, just an (almost) reversible translation which umich
> ldap invented.

I have found a lot of interesting information in a paper by Enrique
Silvestre Mora (the author of the translation code in charset.c).
The URL given in charset.c is no longer valid, but I found it in:

	ftp://ftp.uji.es/pub/unix/ldap/iso-t61.translation.tar.gz

It explains a lot of the apparently arbitrary decisions that happen
here and there in the code.  He claims everything is based on RFC
1345.  So we come back to it.  So, I think that any additional
translations should be taken from it.

Regards,

Julio