[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Problems with case folding of UTF-8

To: Stig@OpenLDAP.org
Subject: Re: Problems with case folding of UTF-8
From: Pierangelo Masarati <masarati@aero.polimi.it>
Date: Mon, 10 Dec 2001 19:10:35 +0100 (MET)
Cc: openldap-devel@OpenLDAP.org
In-reply-to: <20011210190248.A26736@itea.ntnu.no> from Stig Venaas at Dec "10," 2001 "07:02:48" pm

> On Mon, Dec 10, 2001 at 10:05:57AM +0100, Pierangelo Masarati wrote:
> > Hi,
> > 
> > while dealing with DN normalization I had a serious problem.  I was 
> > dealing with Italian accents in Latin-1 ("e acute", "e grave" and so),
> > and this is what happened:
> 
> I think the behavior is correct,

sure, but what about "iconv" not accepting it?

> but I need to do some checking before
> I can give you a definite answer. Hopefully I'm able to do that in 8-9
> days. If you feel like it you can look at the Unicode tables yourself.
> Look at UnicodeData.txt in HEAD. The format is explained at the
> Unicode consortium web.

I'll try

> 
> > b) breaks the current DN normalization workaround in slapd because 
> > the resulting normalized DN is longer than the input one (a six-char 
> > '\c3\89' is turned into a seven-char 'E\cc\81' when there's an 
> > equivalent six-char representation)
> 
> Yes, there can be multiple equivalent representations, and the
> normalized representation is often not the shortest one, so we
> have to allow for the strong to grow, which means that you need
> to do new allocation for the normalized string. UTF8Normalize or
> whatever I called it, does this.

Well, at present the issue is that most of the code calls dn_normalize
expecting in-place normalization. Of course the new dnNormalize routine
will obsolete that, but in the meanwhile we need to deal with in-place
behavior.  I'm a bit scared about speeding up the replacement because
it might interfere with other developers if I commit too often; on the
other hand I need to commit very often, as you may notice, to keep the
pace of the other developers: the last weekend I got 250 messages, most 
of which commits by Howard :)

Seriously, the "multiple equivalent representations" again scare me a bit,
because unless our normalization routines are pretty robust, uniquely 
choosing the same representation regardless of the input, we won't end up
with a unique string (not even structural) representation of the DNs.

If at all possible, I'd rather prefer the short representation regardless
of the input form.

Pierangelo.

Follow-Ups:
- Re: Problems with case folding of UTF-8
  - From: Stig Venaas <Stig@OpenLDAP.org>
- RE: Problems with case folding of UTF-8
  - From: "Howard Chu" <hyc@highlandsun.com>

References:
- Re: Problems with case folding of UTF-8
  - From: Stig Venaas <Stig@OpenLDAP.org>

Prev by Date: Re: Problems with case folding of UTF-8
Next by Date: Re: Problems with case folding of UTF-8
Index(es):
- Chronological
- Thread