[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Problems with case folding of UTF-8



> On Mon, Dec 10, 2001 at 10:05:57AM +0100, Pierangelo Masarati wrote:
> > Hi,
> > 
> > while dealing with DN normalization I had a serious problem.  I was 
> > dealing with Italian accents in Latin-1 ("e acute", "e grave" and so),
> > and this is what happened:
> 
> I think the behavior is correct,

sure, but what about "iconv" not accepting it?

> but I need to do some checking before
> I can give you a definite answer. Hopefully I'm able to do that in 8-9
> days. If you feel like it you can look at the Unicode tables yourself.
> Look at UnicodeData.txt in HEAD. The format is explained at the
> Unicode consortium web.

I'll try

> 
> > b) breaks the current DN normalization workaround in slapd because 
> > the resulting normalized DN is longer than the input one (a six-char 
> > '\c3\89' is turned into a seven-char 'E\cc\81' when there's an 
> > equivalent six-char representation)
> 
> Yes, there can be multiple equivalent representations, and the
> normalized representation is often not the shortest one, so we
> have to allow for the strong to grow, which means that you need
> to do new allocation for the normalized string. UTF8Normalize or
> whatever I called it, does this.

Well, at present the issue is that most of the code calls dn_normalize
expecting in-place normalization. Of course the new dnNormalize routine
will obsolete that, but in the meanwhile we need to deal with in-place
behavior.  I'm a bit scared about speeding up the replacement because
it might interfere with other developers if I commit too often; on the
other hand I need to commit very often, as you may notice, to keep the
pace of the other developers: the last weekend I got 250 messages, most 
of which commits by Howard :)

Seriously, the "multiple equivalent representations" again scare me a bit,
because unless our normalization routines are pretty robust, uniquely 
choosing the same representation regardless of the input, we won't end up
with a unique string (not even structural) representation of the DNs.

If at all possible, I'd rather prefer the short representation regardless
of the input form.

Pierangelo.