[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: filter problems



Hallvard B Furuseth wrote:
[filter] says:


6.  String Search Filter Definition
(...)
                 Since RFC 2254 does not clearly define the term
 "string representation" (and in particular does mention that the
 string representation of an LDAP search filter is a string of UTF-8
 encoded ISO 10646-1 characters) implementations SHOULD accept as
 input strings that include invalid UTF-8 octet sequences.


I don't understand this.  RFC2254 does say filters should be UTF-8,
_therefore_ implementations should accept invalid UTF-8?  I would have
thought that therefore they should _not_ accept invalid UTF-8.

Maybe with "invalid UTF-8" you just mean e.g. U+0065 encoded as an
"UTF-8" 2-byte sequence (0xc1 0x81)?  Or do you also mean e.g. a lone
0x80 octet in the middle of some ASCII characters?

Or maybe it's just that I can't parse the last two lines into a coherent
sentence, and guessed wrong what it should be.  Could you split it into
two sentences or something?

Actually, I think one critical word is missing from the sentence: the word "not" should appear between "does" and "mention." Good catch! Here is a suggested revision:


   Implementations SHOULD accept as input a string that includes
   invalid UTF-8 octet sequences. This is necessary because RFC 2254
   did not clearly define the term "string representation" (and in
   particular did not mention that the string representation of
   an LDAP search filter is a string of UTF-8 encoded ISO 10646-1
   characters).


BTW, I think "RFC 2254 does not..." should be "did not...", since
[Filter] obsoletes RFC 2254.

OK; this is also fixed in the suggested text above.



Another detail:


4. String Search Filter Definition
(...)
 Other characters besides the

^^^^^^^^^^

 ones listed above may be escaped using this mechanism, for example,
 non-printing characters.


The first "characters" should be "octets".  That is, if you escape U+C0,
you say \C3\80, not \C380 or \C0.

Right, but the text in draft-ietf-ldapbis-filter-04.txt does say "octets." Maybe you are quoting from an older draft?



...  Though perhaps you should repeat here
that the resulting string must be a valid UTF-8 string; you can't escape
just one octet in an UTF-8 multibyte character but not the others.

Here are the two relevant paragraphs from filter-04:

   This simple escaping mechanism eliminates filter-parsing ambiguities
   and allows any filter that can be represented in LDAP to be
   represented as a NUL-terminated string. Other octets that are part of
   the <normal> set may be escaped using this mechanism, for example,
   non-printing ASCII characters.

   For AssertionValues that contain UTF-8 character data, each octet of
   the character to be escaped is replaced by a backslash and two hex
   digits, which form a single octet in the code of the character.

I think the "each octet" statement covers things fairly well.  Do you agree?

Thank you for your detailed comments.

-Mark Smith
 Netscape/AOL