[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: normalised UTF-8, should it be "decomposed", or "composed"?



> -----Original Message-----
> From: Stig Venaas [mailto:Stig@OpenLDAP.org]

> On Wed, Feb 20, 2002 at 06:23:56AM -0800, Howard Chu wrote:
> > Thinking about this more, it might make sense to add this behavior onto
> > the existing approxMatch stuff. Currently the approx code strips any
> > 8 bit characters from the input strings. To make it slightly
> more general,
> > we could first decompose the strings using compatibility mapping (NFKD).
> > It looks like the liblunicode currently doesn't handle compatibility
> > decompositions though.
>
> Yes, I agree. I had some plans on this myself, but never got that far.
> I don't have time to add NFKD now I think (need to check how much work
> it would be), but what we easily can (and should do) right away, is to
> simply skip the composition part in approximate match (leaving us with
> NFD) and then strip 8-bit characters. I'll look into this very soon.
> Before releasing 2.1 we should try to finish things that affect indexes
> so that people don't need to recreate them later. Optimizations like
> checking for normalized forms can easily be done between minor versions.

One more thing - slapd always normalizes the asserted value
before performing a match. Both caseExactMatch and caseIgnoreMatch
currently uses UTF8normcmp, which normalizes both of its input strings.
We should have a function for this case, where one input is already
normalized, to avoid that additional overhead.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support