[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: [ldapext] UTF-8 full support in LDIF / LDIF v2




On Jun 10, 2009, at 10:34 PM, Yves Dorfsman wrote:

Kurt Zeilenga wrote:
There are a number of problems with it. Personally, I think what Steven already offered (and likely implemented) is better, though I am

My problem with Steven's solution is that it is half LDIF, half XML. As I have mentioned earlier, I thing XML has its place, and maybe DSML should be fixed or re-invented, but for other application, I find the simplicity of LDIF an advantage ; unfortunately, having to base64 encode anything that's not 7 bit ASCII takes away some of its simplicity.

concerned about line separators. As Howard comments kind of suggests, when you have a value which is multi-lined,

I have never run into the situation where I needed a multi-line value in an LDAP directory and was surprised by the need, but Steven brought this up earlier in the thread and said that he has a real- world need for it, and that the lack of a syntax for it in my proposition for an updated LDIF format was an issue.

The problem is that what line separators to use is syntax specific (or possible attribute value convention specific). For instance, an LDAP syntax could say multiple lines are to be separated by a particular set of code points (such as '$') or it could be simple be a convention that an attribute uses a particular set of characters.

To convert a LDIF specific line separator to an attribute value line separator requires not only knowledge of the LDAP schema, but knowledge of the attribute value conventions not expressed in the LDAP schema.

LDIF however was designed to allow a mechanical conversion of LDIF to LDAP PDUs without such knowledge. Requiring implementations to have additional knowledge is quite problematic.


The problem with your proposal, and Steven's, is that LDIF line separators and value line separators are one and the same thing. While one might be case occasionally, it cannot be expected to be generally the case.

On the contrary, both Steven's solution and mine separate the lines but do not impose a line separator. Steven delimits his line with the <item></item> syntax,

I think you confuse XML elements with line separators in XML data. Two very different things.

Steven's proposal represents line separators in the XML data using the <SEP> production.

while I let the user choose any line separator out of the half dozen that has been used throughout the history of computing.

The problem here is how line separators in LDIF relate to line separators in the value.

Your approach assumes that whatever line separator the user chooses to use in the LDIF file is valid per the LDAP value syntax and any attribute type specific restrictions.


Our syntaxes are clear enough to let the import process know that those are separate lines, and the import process or the LDAP server can choose whichever line separator it wants.

That requires LDAP schema and attribute type specific restrictions knowledge.

Making the line separator part of the data will create cross- platform issues.

Yes, but this is what your proposal seems to do.  (See below).

The LDAP server or actually the LDAP client should choose which line separator to use for its context/platform.

Today, LDIF line separators (<SEP>) are not part of the LDAP value.

That is,

foo: X
 Y

is equivalent to:

foo: XY

That is, the LDAP value is merely wrapped over multiple LDIF lines.

Now maybe you meant:

foo:<<EOT
X
Y
EOT

to also be equivalent to:

foo: XY

I don't see any value in offering yet another way to line wrap an LDAP value.

I took your proposal as representing foo attribute value "X<SEP>Y" where <SEP> was the sequence of characters used in the LDIF to separate X and Y. This is problematic.




Adding UTF-8 support does appear to be in support of improving LDIF as a proper interchange format. It seems to be driven by other goals, such as trying to make LDIF files displayable.

Yes and no. My main reason for pushing this is diffing.

Diffing requires knowledge of LDAP schema. One might store "foo" and get back "FOO" (or any other equivalent value) [See LDAP's data preservation requirements].

You run into a problem and you want to diff the original and the problematic LDIF export of your directory. Having half of your LDIF file base64 encoded makes it a lot more difficult to pin point the problem.

As Michael noted, there exists LDIF diffing tools (most of which are likely not schema aware, and hence show equivalent values as being different).

If you are right, that LDIF is purely for exchanging information between applications, never to be looked at by humans, then why is the current version so human friendly ?

I never claimed LDIF would not be looked at by humans.

I have stated that LDIF, with ASCII restrictions, already suffers from some display/editing issues namely due line separator issues. Lifting the ASCII restriction will make these matters far worse. (Line separators are the tip of the displayability/editability iceberg.)




I'm not convinced that removing the ASCII restrictions will be a good thing. Not only do I doubt it will have a net positive on displayability of LDIF for those who have a displayability goal (I don't this goal), I think it will have a net negative impact on interoperability and user confusion, such as when the user creates a file using one Unicode normalization algorithm, but is trying to set values which require a different Unicode normalization value.

How so ?
In the current version, you have to encode your Unicode to UTF-8, and then encode it again to base64. With my proposal, you would get the exact same UTF-8 strings as you do today, but they would not be (or would not have to be) encoded in base64.

I see two kinds of problems.

1) This would result in LDIF files which programs designed to display UTF-8 encoded Unicode text will not be able to display. There is a user expectation that LDIF files be displayable. With the current LDIF format, we do have some display issues (e.g., line separators), but they are limited. If we remove the ASCII restrictions, we'll run into a wide range of display issues.

2) Today we have some separation between (non-ASCII) Unicode LDAP attribute values and their LDIF representation. This separation, I think, has some value in that it instills LDIF syntactically requirements are LDAP attribute value syntax requirements are independent of each other. Removing this separation, I think, will lead to user confusion.

This is not a rebuttal of your argument, I am truly interested in understanding what you mean here (in the same way I was glad somebody brought up the issue of Right To Left characters, as I had not thought about it). Maybe it is a problem that we can address ?

Keep what separation we do have between LDIF representation of an attribute value and the LDAP syntax for the attribute value.



if so, should we help Steven with the xmled RFC ?
What Steven and Andrew have done is define an extension for LDIF to allow XML values to be represented in a human-readable format instead of requiring the use use of base64. Unfortunately his proposal has interchange issues (see the I-D's security considerations section). This, I think, is a fatal problem with this extension.

So really this is the issue, should the value of the line separator be part of the data, or should everybody (LDIF importers/exporters, LDAP servers, LDAP clients) treat multi-line entries as just that, several lines, and choose their own line separator ? (in case I wasn't clear earlier, I am in favour of the latter).

In RFC 2849, <SEP> is never part of the LDAP attribute value.

In ELDIF extension, certain <SEP>s are part of the LDAP attribute value. This, I think, is problematic.


--
Yves.
http://www.sollers.ca/

_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext

_______________________________________________
Ldapext mailing list
Ldapext@ietf.org
https://www.ietf.org/mailman/listinfo/ldapext