[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: New Phonetic Design
At 03:02 AM 9/22/2004, Alexandre PAUZIES wrote:
>Why we need a new design ?
>##########################
>
>- for each new phonetic algorithm/language we need to implement a new
> function, add #define etc...
>- phonetic functions are not easy to understand or implement
>- the use of strcmp for matching does not allow a flexible match (see
> SLAPD_PHONETIC_V2_PRECISION)
>- the use of #define does not allow to switch from an algorithm/language
> on runtime (so could not be used with langage codes)
>
>
>How does this design look like ?
>################################
I would like to see the creation of a plugin mechanism
as well, to allow complete replacement of the approximate
matching functions.
>- A new language could easily be added by a new entry (lang, rules,
> post-rules) in the phonetic lang table.
>- A new algorithm could easily be added by writting a simple table of
> rules.
>- Each rule is an action (find/replace...) with a set of condition (is
> preceded by...) which are easy to implement.
>- Each post-phonetic rule is a simple table of ordered characters.
>- The precision of this phonetic mecanism could easily be changed.
>- The default phonetic language could be changed from config file.
It would be good if the table(s) was(were) externally configurable
(via slapd.conf(5) or other means) instead of being hard coded.
>How does this one works ?
>#########################
>
>1) The Phonetic's rules
>-----------------------
>
>- You need to write your own language/algorithm photenic rules :
>
>Here i define rules for french language and phonex alorithm (by Frederic
>BROUARD)
>
>
>static rule_t phonetic_rules_fr_phonex[] =
> {
> }
>
>a rule is defined by an action (ie: FIND_REPLACE) with its arguments
>("ie: h" -> "") and by a set of conditions (ie: NOT PRECEDED BY 'c' OR
>'s' OR 'p')
>
>
>this example :
>
> { {FIND_REPLACE, {"h", ""}}, {{PRECEDED, "csp", NOT|OR}} },
>
>will delete all characters 'h' not preceded by character 'c' or 's' or
>'p'
>
>
>You could write rules with more than one condition like this :
>
> { {FIND_REPLACE, {"s", "z"}}, {{FOLLOWED, "aeiou1234", OR},
> {PRECEDED, "aeiou1234", OR}} },
>
>
>An other example, I want to delete character 't' if it end the word :
>
> { {FIND_REPLACE, {"t", ""}}, {{FOLLOWED, ALL, AND|NOT}} },
>
>etc...
>
>You could find more example by looking in "phonetic.h"
>
>
>2) Post-Phonetic's rules
>------------------------
>
>For now, you got a phonetic function that return a phonetic copy of the
>word (like the old one function) but you can't select how
>permissive/flexible your match will be. That's why the post_phonetic
>function is.
>
>
>You need to define post-phonetic rules by assigning an integer (the
>position of the char on the "char tab[]") to each character (not
>replaced/deleted by your phonetic algorithm).
>
>
>Thoses rules will be used to convert your phonetic word copy to a string
>representing a float value.
>
>
>Example :
>
>static char phonetic_post_rules_fr_phonex[22] =
> {
> '1', '2', '3', '4', '5', 'e', 'f', 'g', 'h', 'i', 'k',
> 'l', 'n', 'o', 'r', 's', 't', 'u', 'w', 'x', 'y', 'z'
> };
>
>will asign number 0 to char '1' ... and number 21 to char 'z'
>
>In this example, those number will be converted to base 22 and the sum
>of all new numbers will become a float. This float number will be store
>into a string.
>
>
>So, to set the precision/flexibility of this new phonetic mecanism, you
>need to set SLAPD_PHONETIC_V2_PRECISION (in schema_init.c) to the
>signifiant number of figure in your float (string) value.
>
>
>Then, an strncmp(word, post_phonetic_word, SLAPD_PHONETIC_V2_PRECISION)
>will be done to do the match.
>
>
>
>3) The Phonetic language table
>------------------------------
>
>Once you've defined your phonetic and post-phonetic rules, you need to
>add them for your language to phonetic_lang[] :
>
>
>static phonetic_t phonetic_lang[] =
> {
> {"fr", phonetic_rules_fr_phonex, phonetic_post_rules_fr_phonex},
> {NULL, NULL, NULL},
> }
>
>
>4) Slapd.conf
>-------------
>
>Set the default "lang" option in you slapd.conf like this :
>
>lang fr
>
>so the phonetic function now which rules to use for your language.
>
>
>4) Enable new Phonetic mecanism
>-------------------------------
>
>Finaly, add the "--enable-phonetic2" option to you configure script.
>
>
>To do:
>######
>
>May be more actions/conditions should be added to this new mecanism to
>suite all languages.
>
>The LDAP_UTF8_APPROX flag passed to UTF8bvnormalize could be a problem
>(ie: I can't do actions or check condition on accentueted characters).
>
>The "lang" option in the config file should be the default lang and not
>the only one because for attributes with language codes we should select
>the corresponding phonetic rules if there is one, or the default one
>(config file defined).
>
>
>
>Any comments will be appreciated.
>
>Best regards,
>
>Alexandre.
>
>
>--
>Alexandre PAUZIES <apauzies@linagora.com>
>LINAGORA - http://www.linagora.com/