[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#8430) Improved handling for large number of databases



On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
> On 27/05/16 14:42, "Hallvard Breien Furuseth" <h.b.furuseth@usit.uio.no> wrote:
>> I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
>> long': It's probably no optimization to use 'long long' when that
>> is not a native integer type.  But I haven't profiled that.
>
> The MDB_WORD is used to get the appropriate word size of the CPU for
> the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
> integer and on a 64 bit CPUs a 64 bit integer.

That's why I suggested unsigned long.  It seems the best guess for a
native wide type.  But it's hard to guess what kind of CPU is in use.

> But the mdb_size_t seems always to be 64 bits?

On a 32-bit host it's normally 32-bit, but you get 64-bit mdb_size_t 
anyway if you build and use LMDB with MDB_VL32 defined.  I misread
your patch, it should not use MDB_VL32.

>> The delay in picking up env->me_dbflags is a bit troubling, since it
>> means creating a new txn is no longer quite atomic.  I can't think of
>> how it'll make a difference when DBIs are used correctly, but that's one
>> thing people screw up.  Maybe it'd be OK if renew0() just copies it into
>> mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.
>
> Is it really atomic in original lmdb? The flags are copied from the
> environment and calculated. But while copying the dbflags from the
> environment they may change. So read level consistency and therefore
> atomicity is not guaranteed.

I should have said, it's atomic when DBIs are used correctly.  Such as
obeying this limitation from the mdb_dbi_open() doc and a similar one
for mdb_dbi_close():

* A transaction that uses
* this function must finish (either commit or abort) before
* any other transaction in the process may use this function.

To guarante atomicity we'd need locks around some DBI operations.
To avoid that, LMDB simply calls unsafe use "user errors" - which can
cause breakage.  Hence those annoying DBI usage restrictions.

LMDB does attempt to catch user errors when it's cheap to try, see e.g.
TXN_DBI_CHANGED().  Doesn't work at all for read-only txns, they have
txn->mt_dbiseqs == env->me_dbiseqs.

So you can see why DBI cleverness makes me nervous.
I think the MDB_WORD looping can go in now, but we'll need to think
carefully about the lazy init.

> But I guess it just does not matter.
> So picking up the flags lazily just extends the time of copying the
> flags from the environment but is basically the same.
> Atomicity and also isolation is a must in respect to the data but I
> think not in respect to the database handles and their flags.
> Maybe I am overseeing something as I do not know all the thoughts
> between the lines.

With ITS#8431 I can think of one thing:

There's no sync between mdb_dbi_open() in one txn and another txn
using the new DBI.  It'll be up to the user to synchronize DBIs.  For
that to work in write-txs, TXN_DBI_CHANGED() must go away, and writers
will have no way to try to catch user errors with sync.  Or if it
should only work in read-only txns, then read-only txns and write-txns
must have different definitions of what is a "user error" with DBIs.

Anyway, we'll need to think about the definition of user errors.


Other notes:

/** ... */ above a global declaration or /**< ... */ to the right
of it tells Doxygen that the comment is the doc of the declared item.
"#foo" inside a Doxygen comment is a link to the declaration of foo.


Maybe mt_dbflags should have an extra byte or word at the end which
does not have DB_UNUSED, and with size rounded up to sizeof(MDB_WORD).
Then stepping to next DBI can do something like this, unrelated to
the check against mt_numdbs.  Not sure if that's an improvement, I'm
getting tired:

/** Return an #MDB_WORD where each byte is \b byte */
#define WORD_BYTES(byte)	((MDB_WORD)-1 / 0xff * (byte))

/** Step up to next used DBI */
#define DBI_STEP(txn, dbi) do { \
     if (! (++(dbi) & (sizeof(mdb_word_t)-1))) \
         while (*(mdb_word_t *)((txn)->mt_dbflags + (dbi)) == 
WORD_BYTES(DB_UNUSED))
             (dbi) += sizeof(MDB_WORD); \
} while (0)