[Date Prev][Date Next] [Chronological] [Thread] [Top]

Problems with case folding of UTF-8



Hi,

while dealing with DN normalization I had a serious problem.  I was 
dealing with Italian accents in Latin-1 ("e acute", "e grave" and so),
and this is what happened:

I used "iconv" to convert strings, and I got:

	e acute -> '\c3\a8'

which, after normalization (case folding) turns out in

	'\c3\88' -> E acute

OK, everything works fine, dn normalization is great!
Then, to make it short:

	e grave -> '\c3\a9'

which after normalization (case folding) should be:

	'\c3\89' -> E grave

but I actually got 

	'E\cc\81'

which should be the composite version of the "E grave", namely the "E"
plus the "grave accent"; but this:
a) annoys "iconv", which refuses to convert this back to Latin-1, 
because "illegal input sequence at position XXX", and
b) breaks the current DN normalization workaround in slapd because 
the resulting normalized DN is longer than the input one (a six-char 
'\c3\89' is turned into a seven-char 'E\cc\81' when there's an 
equivalent six-char representation)

I noticed that "a acute" and "e acute" are converted correctly in 
the "six-char" form, while "e grave", "o acute" and "u acute" are not
(they're the main Italian non-ascii letters); I'm afraid German, Spanish
and French need even more, so this could be a big problem.

Maybe I'm doing something wrong, so please correct me before I start
digging into the UTF-8 code to see what's going on :)

Pierangelo.