[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: normalised UTF-8, should it be "decomposed", or "composed"?



On Wed, Feb 20, 2002 at 05:55:48AM -0800, Howard Chu wrote:
> The Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/
> specifies using the decomposed form. However, since we don't actually
> implement that algorithm, it may not matter. Using the decomposed form
> does allow for tailoring the properties of a search/compare operation,
> if we wanted to go that route. For example, it would enable searches
> that ignored accents, which is sometimes desirable. (Of course there is
> no way to express this search customization in the LDAP protocol, and it
> would require some changes to the index generation logic.) This to me is
> the only advantage to using the decomposed form.

Right, I didn't think of that one.

> In the normalization document http://www.unicode.org/unicode/reports/tr15/
> they note that most input will already be in NFC format. I think it

Yes, probably

> would be best if we implemented the suggestion in Annex 8 of this
> document and detect whether an input string is already in the valid
> normalized format, which would save the most time in the general case.
> If we adopt this approach, then we should stick with the composed form.

Assuming that the input already is NFC, this helps a lot. If percentage
of NFC input is high enough, it is worth it. As I understand it you
will still have to check the ordering, but to check ordering and also
changing it, won't require any memory allocations, so it might still
help a lot.

In parts of the code, we could try to "remember" whether a string is
NFC, that would make it even faster. I guess it might be a bit of an
overkill to store such metadata in the database though, or?

Stig