[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Multilanguage problem



Hi Oren,

On Sunday 26 December 2004 16:42, Oren Shochat wrote:
> Can't seem to add an inetOrgPerson instant to web service with any of the
> French chars  (Found in the Extended ASCII table): ?, ?, ?, ?, ?, ?, ?, ?,
> ?, ?,
>
> My client is written in c++ and uses Netscape free LDAP SDK for C
> programmers.
>
> My Server is OpenLDAP  win32. Whenever I try to add givenname (Unicode
> attribute) with French letters I get Error 21 (Invalid syntax - probably
> wrong ascii chars).

French characters with accents are non-ASCII characters, which you need to 
convert to UTF-8

Here are a few excerpts  from the Unix UTF-8 man page:
PROPERTIES
       The UTF-8 encoding has the following nice properties:

       * UCS characters 0x00000000 to 0x0000007f (the classic US-
         ASCII  characters)  are  encoded simply as bytes 0x00 to
         0x7f (ASCII compatibility). This means  that  files  and
         strings  which  contain only 7-bit ASCII characters have
         the same encoding under both ASCII and UTF-8.

       * All UCS characters > 0x7f are encoded  as  a  multi-byte
         sequence  consisting  only of bytes in the range 0x80 to
         0xfd, so no ASCII byte can appear  as  part  of  another
         character  and  there  are no problems with e.g. '\0' or '/'.

       * The lexicographic sorting order of UCS-4 strings is pre­served.

       * All  possible 2^31 UCS codes can be encoded using UTF-8.

       * The bytes 0xfe and 0xff are  never  used  in  the  UTF-8
         encoding.

       * The first byte of a multi-byte sequence which represents
         a single non-ASCII UCS character is always in the  range
         0xc0  to  0xfd  and  indicates  how long this multi-byte
         sequence is. All further bytes in a multi-byte  sequence
         are  in  the range 0x80 to 0xbf. This allows easy resyn­
         chronization and makes the encoding stateless and robust
         against missing bytes.

       * UTF-8  encoded  UCS  characters  may  be up to six bytes
         long, however the Unicode standard specifies no  charac­
         ters  above  0x10ffff, so Unicode characters can only be
         up to four bytes long in UTF-8.

ENCODING
       The following byte sequences are used to represent a char­
       acter.  The  sequence  to  be used depends on the UCS code
       number of the character:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       The xxx bit positions are filled  with  the  bits  of  the
       character  code  number in binary representation. Only the
       shortest possible multi-byte sequence which can  represent
       the code number of the character can be used.

       The  UCS  code values 0xd800-0xdfff (UTF-16 surrogates) as
       well as 0xfffe and 0xffff (UCS non-characters) should  not
       appear in conforming UTF-8 streams.

EXAMPLES
       The  Unicode  character  0xa9  =  1010 1001 (the copyright
       sign) is encoded in UTF-8 as
              11000010 10101001 = 0xc2 0xa9

       and character 0x2260 =  0010  0010  0110  0000  (the  "not
       equal" symbol) is encoded as:
              11100010 10001001 10100000 = 0xe2 0x89 0xa0


Peter

-- 
Peter Marschall
eMail: peter@adpm.de