[Date Prev][Date Next]
Re: UTF8 case insensitive matching
I have just started to follow this newsgroup, but I am particularly interested in this thread as it relates to something that I would like to do. Currently, the dn_validate()/dn_normalize() routines normalize DNs by converting (at least some) lower case characters to upper case and then removing excess whitespace. However, given the many different ways that RFC 2253 allows for a single DN to be specified, a lot more work needs to be done to ensure that a DN is always normalized the same way. As an example, below are just 3 possible ways to express a single DN:
cn=David Cooper,ou=NIST,o=U.S. Government,c=US
Ideally, I would like to fix the dn_validate() function so that all three of these strings normalize to the same result. While I don't think that I'll be able to fix things so that any arbitrary DN can be normalized, I would like to get as close as possible. One problem that I have, though, is that since DNs must currently be normalized in place, I would need to ensure that the normalized DN is no longer than any other possible representation of that DN. Accomplishing this is particularly complicated by the possible use of quoted attribute values. For example, if I always normalize to a non-quoted representation, then
o="Smith, Jones, and Jackson, Inc."
would be normalized as
o=Smith\, Jones\, and Jackson\, Inc.
Since the commas must be escaped in the non-quoted representation, but not the quoted representation, the normalized DN would be longer than the original string. Similarly, if I always normalized to a quoted representation, then the added quotes could lead to the normalized version being longer than the original string. The only option I have at the moment is to try both methods and then choose the shorter of the two, but this isn't very clean and it seems like wasted effort.
If an alternative solution (i.e., one that allows normalization to increase the length of a DN) will be available, then I will abandon my current approach and wait until a cleaner solution can be implemented.
On Wed, 25 Oct 2000 at 07:07:36PM +0200, Stig Venås wrote:
>On Wed, Oct 25, 2000 at 08:32:57AM -0700, Kurt D. Zeilenga wrote:
> > At 04:31 PM 10/25/00 +0200, Stig Venås wrote:
> > >code would have to be changed then. An easy but incorrect way
> > >out could be to simply not change casing for a character if
> > >the size is different. It would still be better than todays
> > >situation.
> > We can certainly cheat in the short term....
> It's very tempting. But some people will need to recreate or at
> least reindex their database each time we change the normalization,
> right? So it shouldn't change too many times. It's a lot of work to
> do it properly though, and I would like to have something people can
> use soon.
> > Long term, we need to use the dnValidate()/dnNormalizer()
> > semantics instead of the dn_validate()/dn_normalize() semantics.
> > In the mid term, to avoid the ripple effect of the
> > dn_validate()/dn_normalize() change, I suggest that temporary
> > versions of dn_validate()/dn_normalize() be implemented which
> > use dnValidate()/dnNormalize() to do the work but provide old
> > semantics otherwise.
> I don't get this. dnValidate() and dnNormalize() use dn_validate()/
> dn_normalize() today. If dnNormalize() alters the length when normal-
> izing, it can not be used by dn_normalize() to do the work, not with-
> out changing the semantics. Or am I missing something?
> I see two possibilities:
> I cheat and add simplistic UTF8 code to dn_validate()/dn_normalize().
> I leave dn_validate()/dn_normalize() as they are and implement new
> versions of dnValidate()/dnNormalize() with more correct UTF8 code,
> allowing for the possibility that the size of the dn can increase.
> Then we must change a lot of surrounding code so that it uses
> dnValidate()/dnNormalize() instead of dn_validate()/dn_normalize().
> I have no illusions of implementing 100% perfect normalization code