[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: normalised UTF-8, should it be "decomposed", or "composed"?

To: Howard Chu <hyc@highlandsun.com>
Subject: Re: normalised UTF-8, should it be "decomposed", or "composed"?
From: Stig Venaas <Stig@OpenLDAP.org>
Date: Wed, 20 Feb 2002 15:27:24 +0100
Cc: John Hughes <john@Calva.COM>, "'OpenLDAP DEVEL'" <openldap-devel@OpenLDAP.org>
Content-disposition: inline
In-reply-to: <NMEFLNHODBAOPDKNNJALEEPICFAA.hyc@highlandsun.com>; from hyc@highlandsun.com on Wed, Feb 20, 2002 at 05:55:48AM -0800
References: <20020220133934.A10991@itea.ntnu.no> <NMEFLNHODBAOPDKNNJALEEPICFAA.hyc@highlandsun.com>
User-agent: Mutt/1.2.5i

On Wed, Feb 20, 2002 at 05:55:48AM -0800, Howard Chu wrote:
> The Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/
> specifies using the decomposed form. However, since we don't actually
> implement that algorithm, it may not matter. Using the decomposed form
> does allow for tailoring the properties of a search/compare operation,
> if we wanted to go that route. For example, it would enable searches
> that ignored accents, which is sometimes desirable. (Of course there is
> no way to express this search customization in the LDAP protocol, and it
> would require some changes to the index generation logic.) This to me is
> the only advantage to using the decomposed form.

Right, I didn't think of that one.

> In the normalization document http://www.unicode.org/unicode/reports/tr15/
> they note that most input will already be in NFC format. I think it

Yes, probably

> would be best if we implemented the suggestion in Annex 8 of this
> document and detect whether an input string is already in the valid
> normalized format, which would save the most time in the general case.
> If we adopt this approach, then we should stick with the composed form.

Assuming that the input already is NFC, this helps a lot. If percentage
of NFC input is high enough, it is worth it. As I understand it you
will still have to check the ordering, but to check ordering and also
changing it, won't require any memory allocations, so it might still
help a lot.

In parts of the code, we could try to "remember" whether a string is
NFC, that would make it even faster. I guess it might be a bit of an
overkill to store such metadata in the database though, or?

Stig

References:
- Re: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: Stig Venaas <Stig@OpenLDAP.org>
- RE: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: "Howard Chu" <hyc@highlandsun.com>

Prev by Date: RE: normalised UTF-8, should it be "decomposed", or "composed"?
Next by Date: Re: normalised UTF-8, should it be "decomposed", or "composed"?
Index(es):
- Chronological
- Thread