[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2

To: Yves Dorfsman <yves@zioup.com>
Subject: Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
From: Kurt Zeilenga <Kurt.Zeilenga@Isode.com>
Date: Tue, 16 Jun 2009 09:48:17 -0700
Cc: ldapext@ietf.org
Delivered-to: ldapext@core3.amsl.com
In-reply-to: <4A37BCEB.5040103@zioup.com>
References: <49C497F9.7010200@zioup.com> <CD3905D4-2A25-4C56-8187-3CE10D46C929@isode.com> <49C870C6.4010803@zioup.com> <E94B7389-9A6D-4CB6-BB2C-649CCD3FD15B@Isode.com> <49CB192E.5050105@zioup.com> <49CB211C.6070108@eb2bcom.com> <49CB87FE.1050809@zioup.com> <49CC01DE.6040506@eb2bcom.com> <4A24557D.7030006@zioup.com> <4A26A05D.8040105@zioup.com> <245BF18B-2066-4E36-9502-16F4A3140D9E@Isode.com> <4A309775.3080406@zioup.com> <4A311ED1.1030202@stroeder.com> <4A31D27B.3090208@zioup.com> <35B2A165-CE5D-4650-AADE-CC233F71470E@Isode.com> <4A35D23D.5040307@zioup.com> <D437E784-4198-4037-A4EA-0300439C3D2C@Isode.com> <4A37BCEB.5040103@zioup.com>


On Jun 16, 2009, at 8:40 AM, Yves Dorfsman wrote:

Thanks Kurt, sorry for my asking an answer in another email, itseems we were writing at the same time.
Kurt Zeilenga wrote:
-the directory is broken
-you export to LDIF
-compare this LDIF with a previous one from when the directory wasworking.
You don't need UTF-8 for this. A simple text diff tool will tellyou that the base64 differs.
True, diff will tell you that they are different, and where, butthen you need to decode base64 to find what the text is in the twofile to help you understand why they are different.

You have to compare the values octet by octet to understand why theyare different.

If there were no base64 encryption, then you would know right away,in most case, making it a much faster process.


s/encryption/encoding/

I disagree. Display of UTF-8 as text will hide differences thatbase64 or octet-by-octet comparison won't hide.

But now you assume you'll be able to read them. This is a badassumption.
I still don't understand why you are saying this. Can you giveprecise examples ?

Consider a combining character such as diaeresis (U+0308). It canonly be combined with certain base characters. It cannot be combinedwith arbitrary characters. If one has an attribute intended to holda single Unicode code point, then stores a diaresis as a value of thatattribute, in the UTF-8 LDIF, one would end up diaeresis following acolon or a space, neither of which is valid text.

This is just the tip of the iceberg. There are even issues where theLDAP values, being well-formed text by themselves, when composed intoa file will result in the file not being well-formed text, such as invalues which utilize bi-directional text.

IDNA when through all of this. They found that they had to placesignificant restrictions on Unicode domain components to ensure that adomain name was well-formed Unicode text.

You have not suggested placing similar restrictions on LDIF but simplyremoving the ASCII restriction.

A simple diff tool might show two DIFFERENT values the same way,leading the human to believe there is no difference when there is asignificant difference.
So, for example, one file contains U+2026 (ellipsis, "…") while theother contains three U_002E (three times the full stop character,"..."), and the issue of duplication in Unicode (http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode) ?

Well, that's one case. I had a few cases in mind: characterequivalences and look likes (but non-equivalences).

Well, what do we do today when diff shows us that there is adifference in two ascii files, but our eyes can't see it ? We hexdump of the offending line(s), and go "ahah, I've got <CR> here, but<CR><LF> there).

But there are more subtles involved. There are cases where the diffwill show a difference, your eye will think it sees the difference,but the actual difference will be hidden.

This is why ldif difference tools are needed and why they have beenwritten. Removing the UTF-8 restriction won't make their job anyeasier. It will just add another encoding option.

On the other hand what's the percentage of time you diff files andrun into this problem ?

Well, where such problems are less likely, that makes the hiddendifference problem even worse (as won't be an expected issue). Insome situations, the problem will be quite likely.

It was likely enough in IDNA for restrictions (*) to be placed on codepoint sequences used in domain names. (* many of the INDA Unicoderestrictions were to address other issues, but there are IDNArestrictions in place to ensure that domain components and domain namesequences of Unicode code points are well formed text.)

Having those base64 encoded will make the fact that they aredifferent more obvious, but won't help us when we're trying tounderstand why they are different, we will still have to go base64 --> UTF-8 --> hex values.

While I don't disagree that to understand the differences, one has toconsider the value octet by octet, this in no way implies that we needanother way to encode those octets.

Data can be different because it comes from different sources, inwhich case it is likely that people will use different duplicate ofUnicode point to express the same thing, but data can also bedifferent because it was changed, which is the case I am moreconcerned with (backup of LDAP servers etc...), and in this case, Idoubt somebody will change the mu symbol to the micro symbol orsomething similar.

Ah, but someone might apply normalization to the data. This may wellbe worthy of detection.

It might happen for malicious reasons, but I expect that to be asmall percentage.

Could be. But could happen for non-malicious reasons. Things couldbe "broken" because some re-normalized the data using the wrongnormalization function. Visual differencing won't detect this sortof change, as normalization doesn't change the display of the text (itjust changes which code points are used to represent the text).

The key phrase here is "Unicode text". And most such display toolsnot only require "well-formed" text, but often cannot display all"well-formed" text. But removing the ASCII restriction does notmake a LDIF file "Unicode text". It makes it a series of Unicodecode points and hence display of it as text will be quiteproblematic.
I don't understand what you are saying here. Could you give anexample of a problematic case, or a link to an explanation ?


See the bare combining character illustration above.

Why is a series of Unicode point not text ?

For a series of Unicode code point to be text, certain restrictionsmust be maintained. For instance, U+0308 by itself is not text. Ithas to be combined with an appropriate base character to be text.

What do you mean by "well-formed text", or actually by text that isnot well-formed ?

Well, even where a sequence of code points represent a sequence ofcharacters and control information, that sequence itself may not bewell-formed text.

It is also important to note that the sequence composed of twosequences of code points each representing well formed text may not bewell-formed text. That is, WF(A + B) may be false even where WF(A)and WF(B) are true. BIDI is case where this often happens.

And even it's displayable, you have the problem that two valuesmight display in the same way, making visual diff'ing problematic.
This is the issue above (Unicode duplication) ? Right ?


It's a problem of equivalences, look-a-likes, etc.

>> Other case: People have mentioned scripts that build LDIF file from
>> other source, and have mentioned that encoding the values inbase64 is
>> an overhead they could do without.
>
> While base64 data is an additional step, it's an additional stepthat> well supported today. If we lift the ASCII restriction now, we'llhave> some implementations that do support it and some that don't, andthat> will cause interop problems. I cannot support inducing suchinterop
> problems without a strong justification.
Adding a new version of the standard, does not remove the old one.You can tell people that you only work with version 1 LDIF files.The advantage of having a standard, is that the tools will slowlyadopt it and will be able to deal with it. If we don't create astandard, we run into the situation where everybody creates theirown format, and create their own tools to transform their format tothe current LDIF one.

When LDIFv1 was introduced, it brought with it interop problems with U-Mich LDIF (v0). The standards community however well justified thestandardization of LDIF, the changes from U-Mich LDIF. But to thisday, we still suffer from some interop problems tied specifically todifferences between U-Mich LDIF and v1.

If I remember right, Jim mentioned that he already uses his versionof extended LDIF.

There are actually a number of extended versions of LDIF in usetoday... and a number of interop problems exist because of this.Standardization won't necessary remove the desire to have extendedversions of LDIF, as the standard LDIF is never going to meteveryone's desires for additional capabilities.

In my opinion, in revising Standard Track LDIF, we should focus onchanges that improve data interchange functionality andinteroperability. We should dismiss changes that our outside of theprimary goal of having an interoperable interchange format, includesdismissing changes intended to improve the "human readability" of LDIF.


-- Kurt

_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext

Follow-Ups:
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Kurt Zeilenga <Kurt.Zeilenga@Isode.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Michael Ströder <michael@stroeder.com>

References:
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Kurt Zeilenga <Kurt.Zeilenga@Isode.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Michael Ströder <michael@stroeder.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Kurt Zeilenga <Kurt.Zeilenga@Isode.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Kurt Zeilenga <Kurt.Zeilenga@Isode.com>
- Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
  - From: Yves Dorfsman <yves@zioup.com>

Prev by Date: Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
Next by Date: Re: [ldapext] UTF-8 full support in LDIF / LDIF v2
Index(es):
- Chronological
- Thread