[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: String conversions UTF8 <-> ISO-8859-1



> let me try to restart the discussion a bit.

I'm not quite sure how to reply to this one - shall I just repost some
of what I said before?  Well...

First of all, I think you are trying to bite over too much.  I think a
character set conversion API which tries to be everything to everyone is
bound to be too clumsy.  Which is why I'm personally only interested in
getting it to be useful in _most_ cases.  I think people can do their
own conversion in the remaining cases.

> There are applications which use different character sets and encodings
> when interacting with the user then when interacting with the
> directory.  Those applications will need access to an appropriate
> conversion routine.  Personally, I think applications should
> deal with conversion issues at the user interface, not at the LDAP
> interface.  (...)

Meaning, the application should think UTF-8 internally?  I disagree.
The choice of internal character set in the application is up to the
application developer, not to us.

Sure, some applications should think in UTF-8, but far from all.  For
example, we have one which speaks latin-1 to the user and uses a latin-1
database.  No need for that one to convert everything to UTF-8 and back
just because it will be using LDAP which speaks UTF-8.

For that matter, even if the application thinks in Unicode, it may well
be using some other encoding than UTF-8, so it would still need to
convert LDAP data to/from UTF-8.

Finally, this discussion is in any case only relevant to applications
that do not think in UTF-8 internally.  Applications that think in UTF-8
are irrelevant to any conversion tools we might make because they'll use
their own conversion tools to convert everything at once.  They won't
single out LDAP data to be converted specially.

> Anyways, the LDAP API, at least as currently designed, has little
> knowledge of which values are character strings as the protocol itself
> does not impart that knowledge in its encoding but by tokens whose
> semantics are defined in user application schema or solely by
> applications.

Yes, that's why any conversion tool we make must contain a hook for the
user to specify schema information.

> Also, the API is unaware of extensions (attribute description options,
> controls, etc., which might affect the encoding of strings carried in
> the protocol.

I'm not convinced that this is real issue:

- attribute description options:

  The API _is_ aware of these: Since it must be schema-aware, it must be
  passed the attribute descriptions anyway.  (Sorry, I said attribute
  _types_ before.  I should have said descriptions, partly for this
  reason.)

- Extensions - extra BER fields in an LDAPMessage:

  I don't think these may change the encoding of data in the message,
  because extra BER fields may be ignored by the receiver, and if it
  does, the data will be incorrect.

- Controls in requests:

  The application knows what it sends of controls, so if a control says
  the encoding of the data is not UTF-8, the application knows it need
  not use the conversion API.

- Controls in responses:

  It seems dangerous to me for these to modify the character set or
  encoding of data in the response, unless the user asked for that, so I
  doubt servers will do so 'spontaneously' in real life.  OTOH, I guess
  the user could send a critical 'convert data in the response' control,
  and the server would include a 'data converted' control.  Again, a non-
  issue:  The application knows what it asked for, so it need not check.

  If the 'convert data in the response' control is _not_ critical, the
  application must do _more_ work than if it did not ask for conversion,
  since it must handle both cases, so it seems a pretty useless thing to
  do.

  Still, if the possibility worries you, the solution would be for the
  API to handle the controls first and the rest of the data afterwards,
  and to pass the 'convert or not' conclusion from the control to the
  rest of the data handling.  Or the application could just do an
  explicit if (!conversion_control_provided) { convert; }, but I suspect
  it would need to pass the conversion_control_provided flag around a
  bit.

- Character data in controls, extra BER fields and extended requests:

  Whatever we do, the API can only unpack and handle the ones both it
  and the user knows.  The API _can_ be sent them in their raw form,
  though.  E.g. convert(LDAP_CONTROL, "oid", "value").

> Even if you were to make the API schema-aware, the API would not
> be aware of extensions it does not implement.

Again, I think you are trying to bite over too much.  If any API we make
must handle extensions it does not know, we can't make any such API at
all.

> Also, the API would not be aware of encoding conventions which are not
> reflected in the schema.

Example?

> Now, you suggested some sort of callback mechanism.  (...)

I'll mostly skip this part for now, since we both have strong opinions
about it.  Let's see what else we can agree on in general before we dive
too far into the API choice.

Except...

> callbacks tend only to cover half the problem: conversion of
> information being provided by the directory service.

Huh?  There would be a callback to convert the other way too, of course.
Or if you are talking about conversion of non-LDAP data, as I said I
think that's irrelevant because an application which does that won't use
our API for conversion at all.

> It has been suggested that another approach would be to have a
> higher level API where strings passed (in both directions) between
> the library and the application in the local character set/encoding.
> This library would need to be schema-aware.  It also would needs
> some mechanism for the application to impart additional knowledge
> (such as "passwords I provide are textual").

You mean 'bindRequest passwords that I provide should be converted'?
Of course.  Same with callbacks.  Whatever mechanism we provide must
be able to handle all data fields that may be textual, and give the
application the choice of which fields to actually convert.

> But, of course, this would not address values carried outside of the
> core protocol (such as in controls)

Just the opposite:  Part of the API must be for handling text in
controls.

> nor would it address changes in semantics of the
> core protocol because of the use of extensions).

Don't see why not.  But when you say 'extensions', please say what kind
of extensions you mean: Controls, extended requests, or extra BER fields
in the LDAPMessage.

>> If not, what exactly do you propose?
> 
> I was thinking more of a collection of "tools" (or helper) routines
> that acted upon structures already returned by the API>   The
> applications could use to make it easier to not only perform
> charset/encoding conversions but also other conversions (such
> as language translation).

Sure, whatever mechanism we provide could be used for more than
character set conversion.

> For example, maybe provide a "foreach entry" routine which call
> an application-specified function on each entry in a message
> chain (previously provided by the API).  And then a "foreach
> attribute" routine... etc..

This sounds very slow.  Seems to be it would entail a lot of unpacking
and repacking of Ber elements in the LDAPMessages.

-- 
Hallvard