[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#8430) Improved handling for large number of databases



On Sat, May 28, 2016 at 3:28 AM, Hallvard Breien Furuseth
<h.b.furuseth@usit.uio.no> wrote:
> On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
>>
>> On 27/05/16 14:42, "Hallvard Breien Furuseth" <h.b.furuseth@usit.uio.no>
>> wrote:
>>>
>>> I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
>>> long': It's probably no optimization to use 'long long' when that
>>> is not a native integer type.  But I haven't profiled that.
>>
>>
>> The MDB_WORD is used to get the appropriate word size of the CPU for
>> the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
>> integer and on a 64 bit CPUs a 64 bit integer.
>
>
> That's why I suggested unsigned long.  It seems the best guess for a
> native wide type.  But it's hard to guess what kind of CPU is in use.
>
>> But the mdb_size_t seems always to be 64 bits?
>
>
> On a 32-bit host it's normally 32-bit, but you get 64-bit mdb_size_t anyway
> if you build and use LMDB with MDB_VL32 defined.  I misread
> your patch, it should not use MDB_VL32.
>
>>> The delay in picking up env->me_dbflags is a bit troubling, since it
>>> means creating a new txn is no longer quite atomic.  I can't think of
>>> how it'll make a difference when DBIs are used correctly, but that's one
>>> thing people screw up.  Maybe it'd be OK if renew0() just copies it into
>>> mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.
>>
>>
>> Is it really atomic in original lmdb? The flags are copied from the
>> environment and calculated. But while copying the dbflags from the
>> environment they may change. So read level consistency and therefore
>> atomicity is not guaranteed.
>
>
> I should have said, it's atomic when DBIs are used correctly.  Such as
> obeying this limitation from the mdb_dbi_open() doc and a similar one
> for mdb_dbi_close():
>
> * A transaction that uses
> * this function must finish (either commit or abort) before
> * any other transaction in the process may use this function.
>
> To guarante atomicity we'd need locks around some DBI operations.
> To avoid that, LMDB simply calls unsafe use "user errors" - which can
> cause breakage.  Hence those annoying DBI usage restrictions.
>
> LMDB does attempt to catch user errors when it's cheap to try, see e.g.
> TXN_DBI_CHANGED().  Doesn't work at all for read-only txns, they have
> txn->mt_dbiseqs == env->me_dbiseqs.
>
> So you can see why DBI cleverness makes me nervous.
> I think the MDB_WORD looping can go in now, but we'll need to think
> carefully about the lazy init.

I understand that it is a tricky thing. But maybe the lazy init is
still feasible? In my understand without having all the background it
seems ok to get the current snapshot of the database flags from the
env once it is firstly accessed whether it is at the beginning or
later in the transaction.
Read level consistency should be preserved anyway as mt_txnId is set
in mdb_txn_renew0().

>
>> But I guess it just does not matter.
>> So picking up the flags lazily just extends the time of copying the
>> flags from the environment but is basically the same.
>> Atomicity and also isolation is a must in respect to the data but I
>> think not in respect to the database handles and their flags.
>> Maybe I am overseeing something as I do not know all the thoughts
>> between the lines.
>
>
> With ITS#8431 I can think of one thing:
>
> There's no sync between mdb_dbi_open() in one txn and another txn
> using the new DBI.  It'll be up to the user to synchronize DBIs.  For
> that to work in write-txs, TXN_DBI_CHANGED() must go away, and writers
> will have no way to try to catch user errors with sync.  Or if it
> should only work in read-only txns, then read-only txns and write-txns
> must have different definitions of what is a "user error" with DBIs.
>
> Anyway, we'll need to think about the definition of user errors.
>

So if you drop ITS#8431 it would at least be very helpful to get an
separate error like MDB_INVALID_DPI instead of EINVAL So the
application sharing and caching handles (dbi) could handle the
situation appropriate like use mdb_env_open() so that the dbi becomes
valid in that txn as well.

>
> Other notes:
>
> /** ... */ above a global declaration or /**< ... */ to the right
> of it tells Doxygen that the comment is the doc of the declared item.
> "#foo" inside a Doxygen comment is a link to the declaration of foo.
>
>
> Maybe mt_dbflags should have an extra byte or word at the end which
> does not have DB_UNUSED, and with size rounded up to sizeof(MDB_WORD).
> Then stepping to next DBI can do something like this, unrelated to
> the check against mt_numdbs.  Not sure if that's an improvement, I'm
> getting tired:
>
> /** Return an #MDB_WORD where each byte is \b byte */
> #define WORD_BYTES(byte)        ((MDB_WORD)-1 / 0xff * (byte))
>
> /** Step up to next used DBI */
> #define DBI_STEP(txn, dbi) do { \
>     if (! (++(dbi) & (sizeof(mdb_word_t)-1))) \
>         while (*(mdb_word_t *)((txn)->mt_dbflags + (dbi)) ==
> WORD_BYTES(DB_UNUSED))
>             (dbi) += sizeof(MDB_WORD); \
> } while (0)
>

Yes round it up to sizeof(MDB_WORD) would be obvious in connection
with the optimized loop. If you are willing to add some more #ifdef we
could even get it faster at least for x86_64 using SSE2 instructions.
Then MDB_WORD would be 128 bit and we check 16 flags in one step using
_mm_cmpeq_epi8 and _mm_movemask_epi8 to get the mask of matching
flags. If the mask != 0xffffffff we found an used database. Then we
negate the mask and with __builtin_ffs(mask) we could then find the
index of the first used database.