[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Searching behaviour with Unicode



At 04:50 AM 2001-10-11, Paul Gillingwater wrote:
>I wonder if anyone could advise how OpenLDAP handles search
>matching against entries containing accented characters. 
>I know these are stored as UTF-8, which is then MIME encoded
>for output as Base64.

In the protocol, values of directoryString syntax are sequences
of UTF-8 encoded ISO 10646-1 characters.  This holds not only
for attribute values, but assertion values as well.  There are
a number of intermediate format associated with the protocol.
LDIF is one intermediate format for representing entries and
change requests.  LDIF allows values to be base64 encoded as
needed.  For filters, there is a string representation
detailed in RFC 2254.  This is used by ldap_search(3) and
ldapsearch(1).  The filter string representation allows hex
escaping as needed.

So, just like you (or the tools) can use LDIF's base64 encoding
when appropriate, you can use the filter string representation's
hex escaping when appropriate.   Use of these mechanisms is
generally appropriate when the I/O devices you are using do not
natively support UTF-8.

>But how can we match an accented character against a non-accented character?

First, I hope I made it clear that you can assert any
directory string by providing to ldap_search(3) or ldapsearch(1)
a string representation of the filter.

But *should* (cn=Suarez Quintans) match the (transliterated)
value Suárez Quintáns using caseIgnoreMatch or caseExactMatch
matching rules?  Well, that depends on the decomposition of
the strings (in particular, the of á and a).

Now, in OpenLDAP 2.0, decomposition is not implemented so
they won't match.

>I assume that OpenLDAP 2.0 uses the UCData API 
>(http://crl.nmsu.edu/~mleisher/ucdata-doc.html) with canonical
>decomposition for comparison?

No. OpenLDAP HEAD use UCData and supports decomposition,
2.0 doesn't.