[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: normalised UTF-8, should it be "decomposed", or "composed"?

On Wed, Feb 20, 2002 at 06:23:56AM -0800, Howard Chu wrote:
> Thinking about this more, it might make sense to add this behavior onto
> the existing approxMatch stuff. Currently the approx code strips any
> 8 bit characters from the input strings. To make it slightly more general,
> we could first decompose the strings using compatibility mapping (NFKD).
> It looks like the liblunicode currently doesn't handle compatibility
> decompositions though.

Yes, I agree. I had some plans on this myself, but never got that far.
I don't have time to add NFKD now I think (need to check how much work
it would be), but what we easily can (and should do) right away, is to
simply skip the composition part in approximate match (leaving us with
NFD) and then strip 8-bit characters. I'll look into this very soon.
Before releasing 2.1 we should try to finish things that affect indexes
so that people don't need to recreate them later. Optimizations like
checking for normalized forms can easily be done between minor versions.