[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: RE : Add flag to UTF8normalize and pals to allow accent stripping



Um.... You shouldn't be changing UnicodeData.txt though, that's an official
part of the Unicode standard. (Speaking of which, we should probably import
the 3.1.0 file into our source tree soon.) Anyway, I think this issue has
already been addressed in one of the DerivedProperties files, we just need
to add some routines to crunch them into binary tables and use them.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

> -----Original Message-----
> From: owner-openldap-devel@OpenLDAP.org
> [mailto:owner-openldap-devel@OpenLDAP.org]On Behalf Of Stig Venaas
> Sent: Tuesday, February 26, 2002 5:17 AM
> To: jean-frederic clere
> Cc: John Hughes; 'OpenLDAP DEVEL'
> Subject: Re: RE : Add flag to UTF8normalize and pals to allow accent
> stripping
>
>
> I've found a better solution to this problem (IMO). Rather than
> changing the code, it can be done by only changing the Unicode
> tables.
>
> The simple solution that works in most cases, is as follows.
> Say you want e and é to match. You can then edit the UnicodeData.txt
> file and replace the two lines
>
> 00C9;LATIN CAPITAL LETTER E WITH ACUTE;Lu;0;L;0045
> 0301;;;;N;LATIN CAPITAL LETTER E ACUTE;;;00E9;
> 00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN
> SMALL LETTER E ACUTE;;00C9;;00C9
>
> with
>
> 00C9;LATIN CAPITAL LETTER E WITH ACUTE;Lu;0;L;0045;;;;N;LATIN
> CAPITAL LETTER E ACUTE;;;00E9;
> 00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065;;;;N;LATIN SMALL
> LETTER E ACUTE;;00C9;;00C9
>
> Next you do something like:
>
> cd /usr/local/src/openldap/libraries/liblunicode
> ./ucgendat -o /usr/local/openldap/share/openldap/ucdata -x
> CompositionExclusions.txt UnicodeData.txt
>
> If you edit UnicodeData.txt before build, you can just build as usual. The
> only problem with this approach is that if a string is passed in
> decomposed
> form to slapd (in most cases it won't be), the match will fail. So I
> suggest you experiment with this. The dat-files you create can be reused
> with new versions later.
>
> The best approach would be to alter comp.dat, that would work also if
> the data is passed to slapd in decomposed form. This is more difficult
> since it's a binary file. But again, you won't have to do this each time
> you compile a new version. The format is quite simple. If you cat it
> through "od -x", you will see for instance:
>
>
> 0001040 0045 0000 0300 0000 00c9 0000 0002 0000
> 0001060 0045 0000 0301 0000 00ca 0000 0002 0000
>
> which means that 00c9 is composed of 2 characters 0045 and 0301. What
> needs to be done is to replace 00c9 with 0045. This can be done with
> hexl-mode in emacs or other tools. This is a bit tedious, but the
> comp.dat you create can be reused when you install new versions.
>
> Stig