[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LDAPprep: mapping of " " values



I don't know how complex substring matching should get but it does seem to me that there is a use for being able to express word-aligned substring matches without jumping through a lot of hoops or thinking deeply about edge cases.

How about the following:

-- existing wording of substring matching rule

The rule evaluates to TRUE if and only the prepared substrings of the assertion value match disjoint portions of the prepared attribute value character string in the order of the substrings in the assertion value, and

an <initial> substring, if present, matches the beginning of the prepared attribute value character string, and

a <final> substring, if present, matches the end of the prepared attribute value character string

-- proposed addition:

, and

an <any> substring, if present and starting with an insignificant space as per [strprep] either matches the beginning of the prepared attribute value character string or matches the attribute value character string at a position following a "breaking space", and

an <any> substring, if present and ending with an insignificant space as per [strprep] either matches the end of the prepared attribute value character string or matches the attribute value character string at a position immediately preceding a "breaking space"

where "breaking space" is defined (similarly to [strprep]) as the SPACE (U+0020) code point followed by no combining marks.

-- end of proposed addition.

This would have the effect that initial and final spaces in substring matches would have the intuitive meaning of restricting the substring match to a word boundary. It would also deal with the issue of spaces used as base characters for freestanding combining characters, since these would not count as "breaking spaces" or "insignificant spaces".

This would allow, for example, the filter:
  (cn=* John * Ellsworth)
to match:
  John Ellsworth
  P. John Ellsworth
  John P. Ellsworth
which is probably what was intended.

It would also allow the filter:
(|(description=* Angola *)(description=* Mozambique *))
to find references to lusophone African countries, whether or not the keywords appeared in initial or final positions.


I don't believe it would add much complexity to the matching algorithm. It might be desirable to add a rule to [strprep] which changed SPACE (U+0020) to NO-BREAK SPACE (U+00A0) if the following character is a combining character, just before the insignificant space removal step (at which point, there cannot be any NO-BREAK SPACEs in the string).