[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: RE : Add flag to UTF8normalize and pals to allow accent stripping

To: jean-frederic clere <jfrederic.clere@fujitsu-siemens.com>
Subject: Re: RE : Add flag to UTF8normalize and pals to allow accent stripping
From: Stig Venaas <Stig@OpenLDAP.org>
Date: Tue, 26 Feb 2002 14:17:20 +0100
Cc: John Hughes <john@Calva.COM>, "'OpenLDAP DEVEL'" <openldap-devel@OpenLDAP.org>
Content-disposition: inline
In-reply-to: <3C7A6752.9792E1B4@fujitsu-siemens.com>; from jfrederic.clere@fujitsu-siemens.com on Mon, Feb 25, 2002 at 05:33:22PM +0100
References: <20020225142005.A29310@itea.ntnu.no> <000401c1be08$334dff50$f70127d5@britannic> <20020225155820.A8473@itea.ntnu.no> <3C7A6752.9792E1B4@fujitsu-siemens.com>
User-agent: Mutt/1.2.5i

I've found a better solution to this problem (IMO). Rather than
changing the code, it can be done by only changing the Unicode
tables.

The simple solution that works in most cases, is as follows.
Say you want e and é to match. You can then edit the UnicodeData.txt
file and replace the two lines

00C9;LATIN CAPITAL LETTER E WITH ACUTE;Lu;0;L;0045 0301;;;;N;LATIN CAPITAL LETTER E ACUTE;;;00E9;
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN SMALL LETTER E ACUTE;;00C9;;00C9

with

00C9;LATIN CAPITAL LETTER E WITH ACUTE;Lu;0;L;0045;;;;N;LATIN CAPITAL LETTER E ACUTE;;;00E9;
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065;;;;N;LATIN SMALL LETTER E ACUTE;;00C9;;00C9

Next you do something like: 

cd /usr/local/src/openldap/libraries/liblunicode
./ucgendat -o /usr/local/openldap/share/openldap/ucdata -x CompositionExclusions.txt UnicodeData.txt

If you edit UnicodeData.txt before build, you can just build as usual. The
only problem with this approach is that if a string is passed in decomposed
form to slapd (in most cases it won't be), the match will fail. So I
suggest you experiment with this. The dat-files you create can be reused
with new versions later.

The best approach would be to alter comp.dat, that would work also if
the data is passed to slapd in decomposed form. This is more difficult
since it's a binary file. But again, you won't have to do this each time
you compile a new version. The format is quite simple. If you cat it
through "od -x", you will see for instance:


0001040 0045 0000 0300 0000 00c9 0000 0002 0000
0001060 0045 0000 0301 0000 00ca 0000 0002 0000

which means that 00c9 is composed of 2 characters 0045 and 0301. What
needs to be done is to replace 00c9 with 0045. This can be done with
hexl-mode in emacs or other tools. This is a bit tedious, but the
comp.dat you create can be reused when you install new versions.

Stig

Follow-Ups:
- RE : RE : Add flag to UTF8normalize and pals to allow accent stripping
  - From: "John Hughes" <john@Calva.COM>
- RE: RE : Add flag to UTF8normalize and pals to allow accent stripping
  - From: "Howard Chu" <hyc@highlandsun.com>

References:
- Re: Add flag to UTF8normalize and pals to allow accent stripping
  - From: Stig Venaas <Stig@OpenLDAP.org>
- RE : Add flag to UTF8normalize and pals to allow accent stripping
  - From: "John Hughes" <john@Calva.COM>
- Re: RE : Add flag to UTF8normalize and pals to allow accent stripping
  - From: Stig Venaas <Stig@OpenLDAP.org>
- Re: RE : Add flag to UTF8normalize and pals to allow accent stripping
  - From: jean-frederic clere <jfrederic.clere@fujitsu-siemens.com>

Prev by Date: Re: RE : Add flag to UTF8normalize and pals to allow accent stripping
Next by Date: RE : RE : Add flag to UTF8normalize and pals to allow accent stripping
Index(es):
- Chronological
- Thread