[Date Prev][Date Next] [Chronological] [Thread] [Top]

filter problems



[filter] says:

> 6.  String Search Filter Definition
> (...)
>                   Since RFC 2254 does not clearly define the term
>   "string representation" (and in particular does mention that the
>   string representation of an LDAP search filter is a string of UTF-8
>   encoded ISO 10646-1 characters) implementations SHOULD accept as
>   input strings that include invalid UTF-8 octet sequences.

I don't understand this.  RFC2254 does say filters should be UTF-8,
_therefore_ implementations should accept invalid UTF-8?  I would have
thought that therefore they should _not_ accept invalid UTF-8.

Maybe with "invalid UTF-8" you just mean e.g. U+0065 encoded as an
"UTF-8" 2-byte sequence (0xc1 0x81)?  Or do you also mean e.g. a lone
0x80 octet in the middle of some ASCII characters?

Or maybe it's just that I can't parse the last two lines into a coherent
sentence, and guessed wrong what it should be.  Could you split it into
two sentences or something?

BTW, I think "RFC 2254 does not..." should be "did not...", since
[Filter] obsoletes RFC 2254.

Another detail:

> 4. String Search Filter Definition
> (...)
>   Other characters besides the
          ^^^^^^^^^^
>   ones listed above may be escaped using this mechanism, for example,
>   non-printing characters.

The first "characters" should be "octets".  That is, if you escape U+C0,
you say \C3\80, not \C380 or \C0.  Though perhaps you should repeat here
that the resulting string must be a valid UTF-8 string; you can't escape
just one octet in an UTF-8 multibyte character but not the others.

-- 
Hallvard