[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Charset translations: Request for comments



Let me ellaborate on a couple of topics where I really need feedback.

The first topic is Charset management: how are they made available
to a program, how are they found when needed and how are they
pre-processed for fast loading.

The current code permits either loading them (read 'parsing them')
at runtime from the original, human-readable, definition or
precompiling them into the executable/library.

The first approach is very flexible, but requires the application
to somehow find the correct file that might not be named exactly as
the charset. It also has two problems:  first, two programs that
load a charset do not share the memory for it (that may be important
for large charsets) and, second, efficiency is not brilliant and may
have an appreciable impact for short-lived programs (think CGI's and
Mail Delivery Agents, not to mention applications using nss_ldap).

The second approach is very efficient:  lost-in-the-noise CPU impact
and implicit sharing of charsets in memory.  The problem is to get
these advantages, you have to compile-in those charsets into the
library.  Once the library is installed, the set of optimized
charsets is fixed and new charsets have to be dealt with the first
method (both are supported simultaneously).

There is a third approach, that I don't currently use: preprocessing
charsets into fast-load files that are loaded at run-time.  In this
case, load time can be very short.  But to exploit the real potential
of this method, mmap should be used so that memory for the charset is
shared.

I would appreciate comments on this, especially as related to platform
capabilities, given the wide range of platforms where OpenLDAP works
or will work in the future.

The second topic is:

> /*
>  * Important Note:  kdz has expressed doubts about this interface that
>  * is inspired by the interface of the translation routines used by
>  * liblber and libldap/charset.c.  So this may change as soon as we
>  * make up our minds about what is best.
>  */

So that can understand what we are talking about, this is the current
interface:

LDAP_F int
ldap_utf8_to_charset LDAP_P((Charset * cs, TranslationMode mode, char **bufp, unsigned long *buflenp, int free_input));

while Kurt has suggested something like this:

LDAP_F int
ldap_utf8_to_charset LDAP_P((Charset * cs, TranslationMode mode, char **outp, char *inp));

In the first case, the model is whoever creates the data is reponsible
for telling others about its length.

In the second case, the model is everyone is responsible for finding out
how long something is.

Remember that cs is an arbitrary character stream represented in a certain
charset that might have an arbitrary encoding so that looking for NUL is
not an option, the length must be derived from scanning the data.  Since
Charset is opaque, the calling application cannot do this.

So the only way the second case can be made to work is by having the
translation routines return a signed integer size large enough to represent
the length of the longest encoding we can produce or a negative value
indicating an error.

The first case is inspired by the translation routines in lber.h that are
designed to work with arbitrary octect streams, not necessarily terminated.
Since the new routines only have to work with real character streams, the
requirement for the creator to tell everyone about its length is no longer
necessary.

I think I understand better the issues now and I am inclined to follow
Kurt's advice, but before I implement such an important change, I'd like
to hear more opinions.

Please notice that, even if the current routines I implemented will expand
allocated areas if needed, they are better served by having an informed
guess about how large they will be and they would have to parse the input
stream for that.  Reuse of staging areas between invocations is an option,
but it is complicated by the threading issues.  Any advice in this respect
is heartily welcome.

Thanks in advance,
 
Julio