[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: normalised UTF-8, should it be "decomposed", or "composed"?



On Tue, Feb 19, 2002 at 11:43:25AM +0100, John Hughes wrote:
> In ldap/libraries/liblunicode/ucstr.c we have around 203:
> 
>                 /* normalize ucs of length p - ucs */
>                 uccanondecomp( ucs, p - ucs, &ucsout, &ucsoutlen );
>                 ucsoutlen = uccanoncomp( ucsout, ucsoutlen );
> 
> Why convert to decomposed form then back to composed?  Wouldn't
> it be better to us decomposed form as the "normalised" form?

Good question. I believe it would improve performance, at least the
normalization would be quicker. There might be some issues with
indexes being larger (and then somewhat slower to access/update
perhaps), and since we do substring matching on bytes not characters,
the indexing and matching job might become slower since there are
more parts to index/match. Still, I think it might be worthwhile.
When this code was written, we had problems when the normalized
string was larger than the input string, I believe that might be
fixed now (it would need to be fixed anyway), but it's much more
critical if we don't do the composition since most non-ascii strings
would become larger (unless the input strings are all decomposed).

Whether we do cannonical composition or not should not change
which strings are considered equal. Cannonical composition is an
injective function on the set of cannonically decomposed strings.
That is, if you have two cannoncially decomposed strings X and Y,
and you perform cannoncial composition obtaining CC(X), CC(Y), then
CC(X) == CC(Y) if and only if X == Y, where == is byte comparison.

I hope Ando and others have some opinions on this as well.

Stig