[Date Prev][Date Next]
Re: Problems with case folding of UTF-8
> On Mon, Dec 10, 2001 at 10:05:57AM +0100, Pierangelo Masarati wrote:
> > Hi,
> > while dealing with DN normalization I had a serious problem. I was
> > dealing with Italian accents in Latin-1 ("e acute", "e grave" and so),
> > and this is what happened:
> I think the behavior is correct,
sure, but what about "iconv" not accepting it?
> but I need to do some checking before
> I can give you a definite answer. Hopefully I'm able to do that in 8-9
> days. If you feel like it you can look at the Unicode tables yourself.
> Look at UnicodeData.txt in HEAD. The format is explained at the
> Unicode consortium web.
> > b) breaks the current DN normalization workaround in slapd because
> > the resulting normalized DN is longer than the input one (a six-char
> > '\c3\89' is turned into a seven-char 'E\cc\81' when there's an
> > equivalent six-char representation)
> Yes, there can be multiple equivalent representations, and the
> normalized representation is often not the shortest one, so we
> have to allow for the strong to grow, which means that you need
> to do new allocation for the normalized string. UTF8Normalize or
> whatever I called it, does this.
Well, at present the issue is that most of the code calls dn_normalize
expecting in-place normalization. Of course the new dnNormalize routine
will obsolete that, but in the meanwhile we need to deal with in-place
behavior. I'm a bit scared about speeding up the replacement because
it might interfere with other developers if I commit too often; on the
other hand I need to commit very often, as you may notice, to keep the
pace of the other developers: the last weekend I got 250 messages, most
of which commits by Howard :)
Seriously, the "multiple equivalent representations" again scare me a bit,
because unless our normalization routines are pretty robust, uniquely
choosing the same representation regardless of the input, we won't end up
with a unique string (not even structural) representation of the DNs.
If at all possible, I'd rather prefer the short representation regardless
of the input form.