[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: normalised UTF-8, should it be "decomposed", or "composed"?

To: "Stig Venaas" <Stig@OpenLDAP.org>
Subject: RE: normalised UTF-8, should it be "decomposed", or "composed"?
From: "Howard Chu" <hyc@highlandsun.com>
Date: Wed, 20 Feb 2002 16:30:43 -0800
Cc: "John Hughes" <john@Calva.COM>, "'OpenLDAP DEVEL'" <openldap-devel@OpenLDAP.org>
Importance: Normal
In-reply-to: <20020220153950.C10991@itea.ntnu.no>

> -----Original Message-----
> From: Stig Venaas [mailto:Stig@OpenLDAP.org]

> On Wed, Feb 20, 2002 at 06:23:56AM -0800, Howard Chu wrote:
> > Thinking about this more, it might make sense to add this behavior onto
> > the existing approxMatch stuff. Currently the approx code strips any
> > 8 bit characters from the input strings. To make it slightly
> more general,
> > we could first decompose the strings using compatibility mapping (NFKD).
> > It looks like the liblunicode currently doesn't handle compatibility
> > decompositions though.
>
> Yes, I agree. I had some plans on this myself, but never got that far.
> I don't have time to add NFKD now I think (need to check how much work
> it would be), but what we easily can (and should do) right away, is to
> simply skip the composition part in approximate match (leaving us with
> NFD) and then strip 8-bit characters. I'll look into this very soon.
> Before releasing 2.1 we should try to finish things that affect indexes
> so that people don't need to recreate them later. Optimizations like
> checking for normalized forms can easily be done between minor versions.

One more thing - slapd always normalizes the asserted value
before performing a match. Both caseExactMatch and caseIgnoreMatch
currently uses UTF8normcmp, which normalizes both of its input strings.
We should have a function for this case, where one input is already
normalized, to avoid that additional overhead.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

Follow-Ups:
- Re: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: Stig Venaas <Stig.Venaas@uninett.no>

References:
- Re: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: Stig Venaas <Stig@OpenLDAP.org>

Prev by Date: str2filter
Next by Date: Re: normalised UTF-8, should it be "decomposed", or "composed"?
Index(es):
- Chronological
- Thread