[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB and text encoding



I've had a brief chat with Hallvard on IRC. We came up with several
possible solutions, although each of them has its drawbacks. Writing
cross-platform code that supports unicode is always a messy business.
I vote for option 4, but would like to hear everyone's opinions before
starting to work on any of them.

1) Separate widechar functions

Make functions such as mdb_env_open_w that would call the widechar
APIs. The drawback of this approach is that it would require a lot of
duplicate code, which is hard to maintain. It would also pollute the
lmdb header file.

2) New flag

Introduce a new flag (such as MDB_USE_WCHAR) that would tell
mdb_dbi_open to cast the path parameter to wchar_t* under the hood and
call the widechar variant of the windows api.

Advantage: only the string concatenation code would need to be duplicated
Drawback: it is really-really ugly

3) Require UTF-16 on Windows

Since Microsoft discourages the use of their ANSI apis, we could say
that we require UTF-16 on windows. We can make a type such as
mdb_uchar_t that we would typedef to char on unix and wchar_t on
windows and then we could change the function signatures to use this
type.

Drawback: users that want to write cross-platform code would need to
ifdef their calls to mdb_env_open

4) Require UTF-8 on Windows

Let's say we require the path parameter to be encoded in UTF-8, even
on windows. Then under the hood we can convert it to UTF-16 and call
the widechar APIs. This doesn't lead to loss of performance because
windows itself converts to UTF-16 anyway if you use their ANSI
functions.
This is the least ugly and perhaps the easiest-to-implement solution
we found. It is easy to make UTF-8 (most libraries can produce it, or
the user could use u8"..." from C++11, etc.)

Advantage: this is the easiest to implement; code that worked before
(with ASCII paths) will work without modification, and we don't need
to duplicate any code.