[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [LDAP] Using foreign charsets / adding entries base64-encoded




Andreas Kotes wrote:

> Hi!
>
> When I add entries containing umlauts the server checks them in fine, and
> ldapsearch gets them fine, too. But when I try to look up the records with
> Netscape I get a "?" for each umlaut, rendering the result unusable.
>
> How can I tell the server to use latin1-charset, or how can I enter
> base64-encoded data, expecially using the Net::LDAPapi-Module for Perl?
> (I think this should work the same way from C) ... I already had a look at
> draft-good-ldap-ldif-00.txt and this helped somehow, but I'm quite unsure
> if this would make any difference for Netscape ... ?
>
>    the Count
>
> --
> Andreas Kotes - mailto:count@linux.de - If you need any help, just ask!
>     -= "Free speech not only lives, it rocks!" --Oprah Winfrey -=-
> -= Commercial use of my email adress NOT allowed. PGP key available. =-

Sure if you store latin-1 you get back latin-1, but as you have seen, the
problem is interoperability.

You should convert your data to UTF-8 before you load it with LDAPADD (or
LDAPMODIFY), converting it to BASE64 wouldn't help you much.

I produced the LDIF File with a PERL script and then I submitted it through a
pipe to the following program:

=================================================================
/* Read Latin-1 (ISO-8859-1) characters from stdin, convert them
   to UTF-8, and write the converted characters to stdout.
   UTF-8 is defined by RFC 2044.
*/
#include <errno.h>
#include <stdio.h>

int
main (int argc, char** argv)
{
    register int c;
    while ((c = getchar()) != EOF) {
        if ((c & 0x80) == 0) {
            putchar (c);
        } else {
            putchar (0xC0 | (0x03 & (c >> 6)));
            putchar (0x80 | (0x3F & c));
        }
    }
    if ( ! feof (stdin)) {
        errno = ferror (stdin);
        perror (argv[0]);
    }
    return 0;
}

=================================================================

Just for sake of completness, I add the reverve conversion:
==================================================================
/* Read UTF-8 characters from stdin, convert them to Latin-1
   (ISO-8859-1), and write the converted characters to stdout.
   UTF-8 is defined by RFC 2044.
*/
#include <errno.h>
#include <stdio.h>

static char UTF8len[64]
/* A map from the most-significant 6 bits of the first byte
   to the total number of bytes in a UTF-8 character.
*/
= {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* erroneous */
   2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6};

int
main (int argc, char** argv)
{
    register int c;
    while ((c = getchar()) != EOF) {
        auto int len = UTF8len [(c >> 2) & 0x3F];
        register unsigned long u;
        switch (len) {
          case 6: u = c & 0x01; break;
          case 5: u = c & 0x03; break;
          case 4: u = c & 0x07; break;
          case 3: u = c & 0x0F; break;
          case 2: u = c & 0x1F; break;
          case 1: u = c & 0x7F; break;
          case 0: /* erroneous: c is the middle of a character. */
            len = 5; u = c & 0x3F; break;
        }
        while (--len && (c = getchar()) != EOF) {
            if ((c & 0xC0) == 0x80) {
                u = (u << 6) | (c & 0x3F);
            } else { /* unexpected start of a new character */
                ungetc (c, stdin);
                break;
            }
        }
        if (c == EOF) break;
        if (u <= 0xFF) {
            putchar (u);
        } else { /* this character can't be represented in Latin-1 */
            putchar ('?'); /* a reasonable alternative is 0x1A (SUB) */
        }
    }
    if ( ! feof (stdin)) {
        errno = ferror (stdin);
        perror (argv[0]);
    }
    return 0;
}
==================================================================


For new implementations I will integrate the Unicode-String package in my PERL
installation.
But I' m not yet finisched with it yet.... :-) :-)


Best regards
G. Baruzzi