[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: UTF-8 support, take 2



Some info and suggestions.  (I don't propose that you try to add all the
suggestions at first - start with a smaller implementation and just keep
the possible extensions in mind, and let's add the more fancy stuff
later.)


Julio Sanchez Fernandez writes:
> I am working towards the following scenario:
> 
> 	- The core in the LDAP servers (slapd and ldapd, but I cannot do
> 	  any testing of the latter) works in Unicode, UTF-8 most likely.

I'll be glad to test an UTF-8 ldapd a bit.

BTW, try to use macros and #ifdefs for the "internal charset" choice.
For servers with Non-European character sets, it may be a speedup to use
another encoding like UCS-4 - at least for some features, like
case-insensitive indexes.  That might also make it easy to fix the code
to handle "local charset in the server" - we just #ifdef out the charset
handling, and fix case-insensitive indexing.  Or something like that.

> 	- These servers know enough of Unicode to do uppercasing for
> 	  case insensitive searches and indexing

Good!
          and reduction of
> 	  characters to their basic form (say 'a' from 'a acute') for
> 	  approximate searching.

I'm in love.

>         I know this is not right for all
> 	  languages, but it is better than what we have now (...)

Absolutely right.  Currently, we have to store both ASCIIfied and true
versions of accented letters under c=NO in X.500, just so foreigners and
7-bit users can search properly.

> Consider this a start.

Try to code it so the matching algorithm can later be extended.  Later,
maybe someone will want to add something like Unicode "level-4 matching"
or whatever it's called.  That gives fairly accurate semantic matching,
like Swedish `ö' == Norwegian 'ø', without matching too much else.
(That's hairy, don't spend time on implementing that just yet...)

> 	- The library (libldap) can also do translation, but only
> 	  T.61 or UTF-8 are supported on wire

Then I'll be adding "local charset on wire" support, if it's easy
enough.

> 	- During compilation, the supported charsets may be specified
> 	  so that unneeded charsets do not increase unnecessarily the
> 	  memory used.  Another possible approach is to load tables
> 	  from disk as needed.  It seems Mozilla does that.  Anyway,
> 	  Unicode is big and it may be necessary to have a big chunk of it
> 	  accessible somehow.

You won't need to load in *all* of Unicode.  I think you'll need

 * upper<->lower and accented->ascii for quite a number of characters,
 * T.61<->unicode for the few characters in T.61 (that is, "few"
   compared to unicode),
 * local charset <-> unicode (usually for <200 characters).
   (Of course, you'll have to load that from disk unless the local
   charset is set at compile time.)

> Note
> that even if an attempt is made to translate non-representable codes to
> some form, Unicode is so large that the possibility of finding something
> that is not understood is significant.

Not really.  Most data will be translatable to latin-1, since that's
what most of those who put data in the directory can handle.

> This may include the case of malformed T.61 or UTF-8 sequences.

Note that allowing and translating malformed sequences can open security
holes at times, in particular with UTF-8.  See Security Considerations
in rfc2279.

> So, for the library, there are four options:  (...)
> 
> 	- Translate non-mappable chars

...to several possible translations:

* A similar character in some charset (local charset or ASCII)
* reversible translation: Local charset if possible,
  something like {hex} otherwise.
* as above, use a semi-readable reversible translation like
  {a'} for a-acute - and {hex} if there is no useful readable
  variant.  Could use the menonic table in rfc1345 for this.
  (Note that it defines some 3-character mnemonics, even
  though it claims to only define 2-character mnemonics.)

Keld Simonsen has written a library "chset" to translate to/from
character sets and to handle trfc1345 menonics.  I've temporarily put an
old version at <URL:http://www.uio.no/~hbf/chset23b.tar.gz>.  I don't
remember where the original was, ask Keld.Simonsen@dkuug.dk if you want
to check for a newer version.  



Other points:

We may want to specify in which cases translation is done in the client:
 * whether or not to translate attributes with DN syntax,
 * more generally: which attributes and/or syntaxes to translate,
 * we may want different options for different operations:
   For example, accept approximate translation in LDAP output and
   in input to search operations, but not in input to write operations.

We may want to add an option turn on/off errors if an attribute has
incorrect encoding (in practice, error if the client is wrong about
which charset is sent "over the wire":-)

Sysadmins we may want to change the displayed metacharacers ('{' and
'}') used for reversible translation, because '{' and '}' are sometimes
used as national characters.  I did this once, but the current charset.c
became such a mess that i removed it.  Just keep in mind that someone
might re-insert such code some day.


> Now, a request.  I need some info on T.61.

See "T.61-8bit" in rfc1345.

However, I suggest you just grab the stuff libraries/libldap/charset.c
supports; I doubt much else of T.61 is in use in real life.  Not enough
to be worth spending time on until the rest of all this works, anyway.

Note that we'll probably want "quipu T.61" or an option for it.  Most
T61 data in X.500 is in quipu.  Quipu uses a somewhat incorrect
implementation of a subset of T.61.

> 	- Clarification on the meaning of code points 0xA6 and 0xA8

0xA6 = the *text* character '#'.
0xA8 = Currency sign, usually '$'.
       However, these are used as a delimiters in some attributes,
       and delimiters should *not* be translated to/from #/$.

quipu doesn't translate 7-bit characters in text to T.61.

0xA6 and so on in quipu means these latin-1 characters instead of their
T.61 valuees.  (Some other characters that do not exist in T.61 also
become latin-1 in "quipu T.61".)

> 	- What are code points 0xD8 to 0xDB and 0xE5?

They aren't.

> 	- What diacritics are represented by 0xC0, 0xC9 and 0xCC?

C0: Unused.  Quipu translates it as "empty accent": "\xC0o" => "o".
C9: Umlaut - but obsolete, one should use 0xC8.  Quipu supports it.
CC: Non-spacing underline.  Can precede any T.61 character.
    If used with an accented letter, it should precede the diacritics
    byte.  I don't know who supports \xCC; quipu doesn't.

> 	- The list of known digraphs (the {xy} forms)

Not part of T.61, just an (almost) reversible translation which umich
ldap invented.

> 	- Any useful Web-accessible resource

rfc1345.

"quipu T.61" is in dsap/common/string.c in
<URL:ftp://ftp.uninett.no/pub/isode/isode-8.ps.Z>.

-- 
Hallvard