[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] draft-pauzies-ldap-schema-nonascii-mr-00.txt



Alexandre PAUZIES writes:

> you're right, the name of those matching rules isn't clear, I'm
> french, so my work is focused on Latin characters, but may be this
> could be useful for other kind of characters, I don't know, that's why
> I choose to ignore all non-ascii characters instead of only latin's
> ones.
> 
> Do you think this could only work on latin characters ?

Sounds like this message got lost, so I'm reposting.
I've appended a few new points at the end (after the ===='s).

From: Hallvard B Furuseth <h.b.furuseth@usit.uio.no>
Date: Wed, 19 May 2004 10:14:21 +0200

Alexandre PAUZIES writes:
> http://www.ietf.org/internet-drafts/draft-pauzies-ldap-schema-nonascii-mr-00.txt

The draft says:
> When using thoses rules, non-ASCII characters such as letters with
> accents are converted (when UTF-8 compatibility conversion is
> possible RFC 2044 [RFC2044]) to ASCII characters (same letter without
> accent) before the match.

First, "same letter without accent" is not a good rule, and I don't know
of any standard tables one can use to implement it anyway.  When I asked
the Unicode mailinglist (unicode@unicode.org) about a similar problem,
the recommendations were:

- use the NFKD decompositions from the UCD, then see if the first
  character is an ASCII character, and if so, remove diacritics in the
  03xx block (that have a "Mn" general category and a non-zero combining
  class).

- or produce the NFD normalisation of the text, and remove all
  characters with a non-zero combining class.

Unlike NFD, NFKD would for example convert the trademark symbol to "TM"
and superscript 2 to "2".

However, many people will find a rule like above close to what they
need, but unfortunately useless or poor for their purpose.  For example,
an important special case I'd want in our server is to translate "ø" to
"o", both for foreigners who have been told plain-ASCII names and so
that Norwegian "ø" would match Swedish "ö".

Other examples (from D. Starner):

  Diagraphs can be treated as titlecase or capital or intelligently.

  00FE - "th"
  00DE - "TH"
  00F0 - "dh" ("th"?)
  OOD0 - "DH" ("TH"?)
  0108 - "CH" (Esperanto)
  0109 - "ch"
  011C, 011D - "GH", "gh" (E-o)
  0124, 0125 - "HH", "hh" (")
  0134, 0135 - "JH", "jh" (")
  015C, 015D - "SH", "sh" (")
  017F - "s"

  Depending on your goals, 015F & 0161 could be "sh", 0163 "ts",
  017D "zh", etc.

  0195 - "hw"
  01A3 - "gh"(?)
  01BF - "w"
  01C0 - "|" ("c"?)
  01C1 - "||"? ("x"?)
  01C3 - "!" ("q"?)
  0223 - "w" ("ou"? "8"?)

  I omitted most capitals and those that can be found by decomposition
  or name stripping, as well a bunch I don't know anything about.

My impression from the Unicode mailinglist is that there are a lot of
such special cases not covered by Unicode, and that the usual solution
is to amend the Unicode character mappings with private mappings at
need.  I think no comprehensive list of such special cases exists.  What
such a list would consist of would in any case depend on e.g. which
languages/cultures/geographical areas it applies to.

Kenneth Whistler said: You could search in the Unicode email archives
for "fallback".  Much of that discussion will be about fallback display
of glyphs, but there have also been discussions about fallback
conversion of characters. There might be further examples or pointers
somewhere in there.  (I did that, but didn't find much which related to
my purpose at the time.  Don't remember if the matches would relate to
your more general rules.  URL <http://www.unicode.org/mail-arch/>.
Note the user name and password at that page.)


Anyway, I suggest that the draft should not specify exactly how these
rules match, but just give a default rule such as suggested above and
allow private amendments.  How much to allow implementations to differ
from the default would depend on the intended purpose of these matching
rules.

So what is the intended purpose of these rules?

For example,

- Are they only intended to be useful for Latin scripts, or could people
  who use other scripts add similar rules?

- another suggestion at the Unicode mailinglist was:
  for Korean syllables (U+AC00 - U+Dxxx), you can use 'Hangul Syllable
  Short Names' that can be algorithmically derived with small tables.

- What is the best trade-off between getting successful matches of
  strings intended to be equivalent, and not getting too many matches
  due to loss of semantics?  For example, how about non-letters?  Would
  it be useful to let some or all punctuation match each other, or to
  treat it all as space?  That way "J. Doe" would match "J Doe".

Once the intent is clarified, I suggest you take this to the Unicode
mailinglist for advice.

================================================================

A few other notes:

- I'm not sure what the point of the ordering rule is.  Even though I'd
  like ö to match o, they should be sorted as different characters
  (because ö usually means ø here).  caseIgnoreOrderingMatch does
  not give the desired result either, but again we are just swapping
  one wrong rule with another wrong rule.

- FYI, caseIgnoreMatch & co are about to be updated to do some string
  preparation, see <draft-ietf-ldapbis-strprep-03.txt> and
  <draft-ietf-ldapbis-syntaxes-07.txt> from the LDAPbis (LDAP revision)
  working group.

-- 
Hallvard

_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www1.ietf.org/mailman/listinfo/ldapext