[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: normalised UTF-8, should it be "decomposed", or "composed"?



> -----Original Message-----
> From: owner-openldap-devel@OpenLDAP.org
> [mailto:owner-openldap-devel@OpenLDAP.org]On Behalf Of Howard Chu

> The Unicode Collation Algorithm 
> http://www.unicode.org/unicode/reports/tr10/
> specifies using the decomposed form. However, since we don't actually
> implement that algorithm, it may not matter. Using the decomposed form
> does allow for tailoring the properties of a search/compare operation,
> if we wanted to go that route. For example, it would enable searches
> that ignored accents, which is sometimes desirable. (Of course there is
> no way to express this search customization in the LDAP protocol, and it
> would require some changes to the index generation logic.) This to me is
> the only advantage to using the decomposed form.

Thinking about this more, it might make sense to add this behavior onto
the existing approxMatch stuff. Currently the approx code strips any
8 bit characters from the input strings. To make it slightly more general,
we could first decompose the strings using compatibility mapping (NFKD).
It looks like the liblunicode currently doesn't handle compatibility
decompositions though.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support