[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: New Phonetic Design



At 03:02 AM 9/22/2004, Alexandre PAUZIES wrote:
>Why we need a new design ?
>##########################
>
>- for each  new phonetic algorithm/language  we need to implement  a new
>  function, add #define etc...
>- phonetic functions are not easy to understand or implement
>- the use  of strcmp for matching  does not allow a  flexible match (see
>  SLAPD_PHONETIC_V2_PRECISION)
>- the use of #define does not allow to switch from an algorithm/language
>  on runtime (so could not be used with langage codes)
>
>
>How does this design look like ?
>################################

I would like to see the creation of a plugin mechanism
as well, to allow complete replacement of the approximate
matching functions.

>- A new  language could  easily be  added by a  new entry  (lang, rules,
>  post-rules) in the phonetic lang table.
>- A new  algorithm could easily be  added by writting a  simple table of
>  rules.
>- Each rule is  an action (find/replace...) with a  set of condition (is
>  preceded by...) which are easy to implement.
>- Each post-phonetic rule is a simple table of ordered characters.
>- The precision of this phonetic mecanism could easily be changed.
>- The default phonetic language could be changed from config file.

It would be good if the table(s) was(were) externally configurable
(via slapd.conf(5) or other means) instead of being hard coded.





>How does this one works ?
>#########################
>
>1) The Phonetic's rules
>-----------------------
>
>- You need to write your own language/algorithm photenic rules :
>
>Here i define rules for french language and phonex alorithm (by Frederic
>BROUARD)
>
>
>static rule_t   phonetic_rules_fr_phonex[] =
>  {
>  }
>
>a rule  is defined  by an action  (ie: FIND_REPLACE) with  its arguments
>("ie: h" -> "")  and by a set of conditions (ie:  NOT PRECEDED BY 'c' OR
>'s' OR 'p')
>
>
>this example :
>
>    { {FIND_REPLACE, {"h", ""}}, {{PRECEDED, "csp", NOT|OR}} },
>
>will delete all  characters 'h' not preceded by character  'c' or 's' or
>'p'
>
>
>You could write rules with more than one condition like this :
>
>    {   {FIND_REPLACE,  {"s",   "z"}},  {{FOLLOWED,   "aeiou1234",  OR},
>                                         {PRECEDED, "aeiou1234", OR}} },
>
>
>An other example, I want to delete character 't' if it end the word :
>
>    { {FIND_REPLACE, {"t", ""}}, {{FOLLOWED, ALL, AND|NOT}} },
>
>etc...
>
>You could find more example by looking in "phonetic.h"
>
>
>2) Post-Phonetic's rules
>------------------------
>
>For now, you got a phonetic  function that return a phonetic copy of the
>word   (like  the   old  one   function)  but   you  can't   select  how
>permissive/flexible  your match  will be.  That's why  the post_phonetic
>function is.
>
>
>You  need to  define post-phonetic  rules by  assigning an  integer (the
>position  of  the char  on  the "char  tab[]")  to  each character  (not
>replaced/deleted by your phonetic algorithm).
>
>
>Thoses rules will be used to convert your phonetic word copy to a string
>representing a float value.
>
>
>Example :
>
>static char     phonetic_post_rules_fr_phonex[22] =
>  {
>    '1', '2', '3', '4', '5', 'e', 'f', 'g', 'h', 'i', 'k',
>    'l', 'n', 'o', 'r', 's', 't', 'u', 'w', 'x', 'y', 'z'
>  };
>
>will asign number 0 to char '1' ... and number 21 to char 'z'
>
>In this example,  those number will be converted to base  22 and the sum
>of all new numbers will become  a float. This float number will be store
>into a string.
>
>
>So, to set the precision/flexibility  of this new phonetic mecanism, you
>need  to  set  SLAPD_PHONETIC_V2_PRECISION  (in  schema_init.c)  to  the
>signifiant number of figure in your float (string) value.
>
>
>Then, an  strncmp(word, post_phonetic_word, SLAPD_PHONETIC_V2_PRECISION)
>will be done to do the match.
>
>
>
>3) The Phonetic language table
>------------------------------
>
>Once you've defined  your phonetic and post-phonetic rules,  you need to
>add them for your language to phonetic_lang[] :
>
>
>static phonetic_t       phonetic_lang[] =
>  {
>    {"fr", phonetic_rules_fr_phonex, phonetic_post_rules_fr_phonex},
>    {NULL, NULL, NULL},
>  }
>
>
>4) Slapd.conf
>-------------
>
>Set the default "lang" option in you slapd.conf like this :
>
>lang fr
>
>so the phonetic function now which rules to use for your language.
>
>
>4) Enable new Phonetic mecanism
>-------------------------------
>
>Finaly, add the "--enable-phonetic2" option to you configure script.
>
>
>To do:
>######
>
>May be more  actions/conditions should be added to  this new mecanism to
>suite all languages.
>
>The LDAP_UTF8_APPROX  flag passed to UTF8bvnormalize could  be a problem
>(ie: I can't do actions or check condition on accentueted characters).
>
>The "lang" option in the config  file should be the default lang and not
>the only one because for attributes with language codes we should select
>the corresponding  phonetic rules  if there is  one, or the  default one
>(config file defined).
>
>
>
>Any comments will be appreciated.
>
>Best regards,
>
>Alexandre.
>
>
>-- 
>Alexandre PAUZIES <apauzies@linagora.com>
>LINAGORA - http://www.linagora.com/