[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: normalised UTF-8, should it be "decomposed", or "composed"?



> -----Original Message-----
> From: owner-openldap-devel@OpenLDAP.org
> [mailto:owner-openldap-devel@OpenLDAP.org]On Behalf Of Stig Venaas

> On Tue, Feb 19, 2002 at 11:43:25AM +0100, John Hughes wrote:
> > In ldap/libraries/liblunicode/ucstr.c we have around 203:
> >
> >                 /* normalize ucs of length p - ucs */
> >                 uccanondecomp( ucs, p - ucs, &ucsout, &ucsoutlen );
> >                 ucsoutlen = uccanoncomp( ucsout, ucsoutlen );
> >
> > Why convert to decomposed form then back to composed?  Wouldn't
> > it be better to us decomposed form as the "normalised" form?

> Good question. I believe it would improve performance, at least the
> normalization would be quicker. There might be some issues with
> indexes being larger (and then somewhat slower to access/update
> perhaps), and since we do substring matching on bytes not characters,
> the indexing and matching job might become slower since there are
> more parts to index/match. Still, I think it might be worthwhile.
> When this code was written, we had problems when the normalized
> string was larger than the input string, I believe that might be
> fixed now (it would need to be fixed anyway), but it's much more
> critical if we don't do the composition since most non-ascii strings
> would become larger (unless the input strings are all decomposed).
>
> Whether we do cannonical composition or not should not change
> which strings are considered equal. Cannonical composition is an
> injective function on the set of cannonically decomposed strings.
> That is, if you have two cannoncially decomposed strings X and Y,
> and you perform cannoncial composition obtaining CC(X), CC(Y), then
> CC(X) == CC(Y) if and only if X == Y, where == is byte comparison.
>
> I hope Ando and others have some opinions on this as well.

The Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/
specifies using the decomposed form. However, since we don't actually
implement that algorithm, it may not matter. Using the decomposed form
does allow for tailoring the properties of a search/compare operation,
if we wanted to go that route. For example, it would enable searches
that ignored accents, which is sometimes desirable. (Of course there is
no way to express this search customization in the LDAP protocol, and it
would require some changes to the index generation logic.) This to me is
the only advantage to using the decomposed form.

In the normalization document http://www.unicode.org/unicode/reports/tr15/
they note that most input will already be in NFC format. I think it
would be best if we implemented the suggestion in Annex 8 of this
document and detect whether an input string is already in the valid
normalized format, which would save the most time in the general case.
If we adopt this approach, then we should stick with the composed form.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support