[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2




On Jun 16, 2009, at 8:40 AM, Yves Dorfsman wrote:


Thanks Kurt, sorry for my asking an answer in another email, it seems we were writing at the same time.


Kurt Zeilenga wrote:

-the directory is broken
-you export to LDIF
-compare this LDIF with a previous one from when the directory was working.
You don't need UTF-8 for this. A simple text diff tool will tell you that the base64 differs.

True, diff will tell you that they are different, and where, but then you need to decode base64 to find what the text is in the two file to help you understand why they are different.

You have to compare the values octet by octet to understand why they are different.


If there were no base64 encryption, then you would know right away, in most case, making it a much faster process.

s/encryption/encoding/

I disagree. Display of UTF-8 as text will hide differences that base64 or octet-by-octet comparison won't hide.

But now you assume you'll be able to read them. This is a bad assumption.

I still don't understand why you are saying this. Can you give precise examples ?

Consider a combining character such as diaeresis (U+0308). It can only be combined with certain base characters. It cannot be combined with arbitrary characters. If one has an attribute intended to hold a single Unicode code point, then stores a diaresis as a value of that attribute, in the UTF-8 LDIF, one would end up diaeresis following a colon or a space, neither of which is valid text.

This is just the tip of the iceberg. There are even issues where the LDAP values, being well-formed text by themselves, when composed into a file will result in the file not being well-formed text, such as in values which utilize bi-directional text.

IDNA when through all of this. They found that they had to place significant restrictions on Unicode domain components to ensure that a domain name was well-formed Unicode text.

You have not suggested placing similar restrictions on LDIF but simply removing the ASCII restriction.


A simple diff tool might show two DIFFERENT values the same way, leading the human to believe there is no difference when there is a significant difference.

So, for example, one file contains U+2026 (ellipsis, "…") while the other contains three U_002E (three times the full stop character, "..."), and the issue of duplication in Unicode (http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode ) ?

Well, that's one case. I had a few cases in mind: character equivalences and look likes (but non-equivalences).


Well, what do we do today when diff shows us that there is a difference in two ascii files, but our eyes can't see it ? We hex dump of the offending line(s), and go "ahah, I've got <CR> here, but <CR><LF> there).

But there are more subtles involved. There are cases where the diff will show a difference, your eye will think it sees the difference, but the actual difference will be hidden.

This is why ldif difference tools are needed and why they have been written. Removing the UTF-8 restriction won't make their job any easier. It will just add another encoding option.

On the other hand what's the percentage of time you diff files and run into this problem ?

Well, where such problems are less likely, that makes the hidden difference problem even worse (as won't be an expected issue). In some situations, the problem will be quite likely.

It was likely enough in IDNA for restrictions (*) to be placed on code point sequences used in domain names. (* many of the INDA Unicode restrictions were to address other issues, but there are IDNA restrictions in place to ensure that domain components and domain name sequences of Unicode code points are well formed text.)


Having those base64 encoded will make the fact that they are different more obvious, but won't help us when we're trying to understand why they are different, we will still have to go base64 -- > UTF-8 --> hex values.

While I don't disagree that to understand the differences, one has to consider the value octet by octet, this in no way implies that we need another way to encode those octets.

Data can be different because it comes from different sources, in which case it is likely that people will use different duplicate of Unicode point to express the same thing, but data can also be different because it was changed, which is the case I am more concerned with (backup of LDAP servers etc...), and in this case, I doubt somebody will change the mu symbol to the micro symbol or something similar.

Ah, but someone might apply normalization to the data. This may well be worthy of detection.

It might happen for malicious reasons, but I expect that to be a small percentage.

Could be. But could happen for non-malicious reasons. Things could be "broken" because some re-normalized the data using the wrong normalization function. Visual differencing won't detect this sort of change, as normalization doesn't change the display of the text (it just changes which code points are used to represent the text).

The key phrase here is "Unicode text". And most such display tools not only require "well-formed" text, but often cannot display all "well-formed" text. But removing the ASCII restriction does not make a LDIF file "Unicode text". It makes it a series of Unicode code points and hence display of it as text will be quite problematic.

I don't understand what you are saying here. Could you give an example of a problematic case, or a link to an explanation ?

See the bare combining character illustration above.

Why is a series of Unicode point not text ?

For a series of Unicode code point to be text, certain restrictions must be maintained. For instance, U+0308 by itself is not text. It has to be combined with an appropriate base character to be text.

What do you mean by "well-formed text", or actually by text that is not well-formed ?

Well, even where a sequence of code points represent a sequence of characters and control information, that sequence itself may not be well-formed text.

It is also important to note that the sequence composed of two sequences of code points each representing well formed text may not be well-formed text. That is, WF(A + B) may be false even where WF(A) and WF(B) are true. BIDI is case where this often happens.


And even it's displayable, you have the problem that two values might display in the same way, making visual diff'ing problematic.

This is the issue above (Unicode duplication) ? Right ?

It's a problem of equivalences, look-a-likes, etc.

>> Other case: People have mentioned scripts that build LDIF file from
>> other source, and have mentioned that encoding the values in base64 is
>> an overhead they could do without.
>
> While base64 data is an additional step, it's an additional step that > well supported today. If we lift the ASCII restriction now, we'll have > some implementations that do support it and some that don't, and that > will cause interop problems. I cannot support inducing such interop
> problems without a strong justification.

Adding a new version of the standard, does not remove the old one. You can tell people that you only work with version 1 LDIF files. The advantage of having a standard, is that the tools will slowly adopt it and will be able to deal with it. If we don't create a standard, we run into the situation where everybody creates their own format, and create their own tools to transform their format to the current LDIF one.

When LDIFv1 was introduced, it brought with it interop problems with U- Mich LDIF (v0). The standards community however well justified the standardization of LDIF, the changes from U-Mich LDIF. But to this day, we still suffer from some interop problems tied specifically to differences between U-Mich LDIF and v1.

If I remember right, Jim mentioned that he already uses his version of extended LDIF.

There are actually a number of extended versions of LDIF in use today... and a number of interop problems exist because of this. Standardization won't necessary remove the desire to have extended versions of LDIF, as the standard LDIF is never going to met everyone's desires for additional capabilities.

In my opinion, in revising Standard Track LDIF, we should focus on changes that improve data interchange functionality and interoperability. We should dismiss changes that our outside of the primary goal of having an interoperable interchange format, includes dismissing changes intended to improve the "human readability" of LDIF.

-- Kurt

_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext