8430 – Improved handling for large number of databases

Issue 8430 - Improved handling for large number of databases

Summary: Improved handling for large number of databases

Status:	UNCONFIRMED

Alias:	None

Product:	LMDB
Classification:	Unclassified
Component:	liblmdb (show other issues)
Version:	unspecified
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-05-27 10:57 UTC by juerg.bircher@gmail.com
Modified:	2020-03-22 22:45 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description juerg.bircher@gmail.com 2016-05-27 10:57:26 UTC

Full_Name: Juerg Bircher
Version: lmdb (master)
OS: MacOS / Linux
URL: ftp://ftp.openldap.org/incoming/Juerg_Bircher_160527-Improved-handling-for-large-number-of-databases.patch
Submission from: (NULL) (178.82.37.195)


There is a increased performance penalty the more databases are created within
the same environment. I was looking for a way the improve that by keeping the
simplicity of tracking databases within a list with direct access by index
(MDB_dbi).
mdb_dbi_open() is however not improved with the assumption that the database
handle (dbi) is cached in the application. So mdb_dbi_open() should happen only
once for each database during the life time of an application.


One issue is that mdb_txn_begin() (for read-only transactions) calloc the
sizeof(MDB_txn) + me_maxdbs * sizeof(MDB_db + 1). The plus 1 for the dbflags.
However it is sufficient only to malloc that size and clear the sizeof(MDB_txn)

    memset(txn, 0, sizeof(MDB_txn)
    
After that the data beyond the MDB_txn is not initialized which is ok for the
moment.

The next improvement happens in mdb_txn_renew0() where the dbflags are only set
to DB_UNUSED (a new flag) for each database currently opened in the
environment.

    memset(txn->mt_dbflags, DB_UNUSED, txn->mt_numdbs);

The former code used to to loop through each database to calculate the dbflags.
This is still done but lazily for each accessed database with the assumption
that a read only transaction rarely uses all databases of the environment.

The lazy initialization of the dbflag happens in the macro TXN_DBI_EXIST which
is always used when a database handle (dbi) is passed to an function.
The flags are updated in mdb_setup_db_info() once a database is access which is
marked as unused (DB_UNUSED).

static int mdb_setup_db_info(MDB_txn *txn, MDB_dbi dbi) {
    /* Setup db info */
    uint16_t x = txn->mt_env->me_dbflags[dbi];
    txn->mt_dbs[dbi].md_flags = x & PERSISTENT_FLAGS;
    txn->mt_dbflags[dbi] = (x & MDB_VALID) ? DB_VALID|DB_USRVALID|DB_STALE : 0;
    return (txn->mt_dbflags[dbi] & validity);
}

    /** Check \b txn and \b dbi arguments to a function and initialize db info
if needed */
#define TXN_DBI_EXIST(txn, dbi, validity) \
    ((txn) && (dbi#C3C(txn)->mt_numdbs && (((txn)->mt_dbflags[dbi] & (validity))
|| (((txn)->mt_dbflags[dbi] & DB_UNUSED) && mdb_setup_db_info((txn), (dbi),
(validity)))))


The next improvement is done in any function which needs to loop through the
databases for example in mdb_cursors_close(). Again the more databases in the
environment the longer the execution time.
It should be best if looping only through dbflags and searching for those
databases which are used (!DB_UNUSED). This could be done byte wise or more
efficient in 8/4 byte steps comparing with an extended mask DB_UNUSED_LONG
instead of DB_UNUSED. So we can skip 8 or 4 (32 bit) unused databases in one
step (still with the assumption that a transaction rarely uses all databases of
the environment).

So the loop looks as follows always starting at the lower index to avoid
alignment issues with ARM prior v6.

#define DB_UNUSED       0x20                /**< DB not used in this txn */

#ifdef MDB_VL32
#define DB_UNUSED_LONG  0x20202020          /* DB_UNUSED long mask for fast
tracking */
#else
#define DB_UNUSED_LONG  0x2020202020202020  /* DB_UNUSED long mask for fast
tracking */
#endif


#ifdef MDB_VL32
#define MDB_WORD    unsigned int
#else
#define MDB_WORD    unsigned long long
#endif



    MDB_dbi %3= src->mt_numdbs;
    MDB_dbi i = 0;

    while (1) {
        unsigned int upper = i + sizeof(MDB_WORD);
        if (upper < n) {
            // skip unused
            if ((*(MDB_WORD *)(tdbflags + i)) == DB_UNUSED_LONG) {
                i = upper;
                continue;
            }
        }
        else {
            upper = n;
        }
        
        for (; i < upper; i++) {
            // any other filter criteria appropriate to the function
            ....
        }
        if (i >= n) {
            break;
        }
    }

Comment 1 Howard Chu 2016-05-27 11:16:49 UTC

juerg.bircher@helmedica.com wrote:
> Full_Name: Juerg Bircher
> Version: lmdb (master)
> OS: MacOS / Linux
> URL: ftp://ftp.openldap.org/incoming/Juerg_Bircher_160527-Improved-handling-for-large-number-of-databases.patch
> Submission from: (NULL) (178.82.37.195)

Thanks for the patch, but for a change of this size you also need to include 
an IP Rights notice, as documented on the Contributing guidelines.
>
>
> There is a increased performance penalty the more databases are created within
> the same environment. I was looking for a way the improve that by keeping the
> simplicity of tracking databases within a list with direct access by index
> (MDB_dbi).
> mdb_dbi_open() is however not improved with the assumption that the database
> handle (dbi) is cached in the application. So mdb_dbi_open() should happen only
> once for each database during the life time of an application.
>
>
> One issue is that mdb_txn_begin() (for read-only transactions) calloc the
> sizeof(MDB_txn) + me_maxdbs * sizeof(MDB_db + 1). The plus 1 for the dbflags.
> However it is sufficient only to malloc that size and clear the sizeof(MDB_txn)
>
>      memset(txn, 0, sizeof(MDB_txn)
>
> After that the data beyond the MDB_txn is not initialized which is ok for the
> moment.
>
> The next improvement happens in mdb_txn_renew0() where the dbflags are only set
> to DB_UNUSED (a new flag) for each database currently opened in the
> environment.
>
>      memset(txn->mt_dbflags, DB_UNUSED, txn->mt_numdbs);
>
> The former code used to to loop through each database to calculate the dbflags.
> This is still done but lazily for each accessed database with the assumption
> that a read only transaction rarely uses all databases of the environment.
>
> The lazy initialization of the dbflag happens in the macro TXN_DBI_EXIST which
> is always used when a database handle (dbi) is passed to an function.
> The flags are updated in mdb_setup_db_info() once a database is access which is
> marked as unused (DB_UNUSED).
>
> static int mdb_setup_db_info(MDB_txn *txn, MDB_dbi dbi) {
>      /* Setup db info */
>      uint16_t x = txn->mt_env->me_dbflags[dbi];
>      txn->mt_dbs[dbi].md_flags = x & PERSISTENT_FLAGS;
>      txn->mt_dbflags[dbi] = (x & MDB_VALID) ? DB_VALID|DB_USRVALID|DB_STALE : 0;
>      return (txn->mt_dbflags[dbi] & validity);
> }
>
>      /** Check \b txn and \b dbi arguments to a function and initialize db info
> if needed */
> #define TXN_DBI_EXIST(txn, dbi, validity) \
>      ((txn) && (dbi#C3C(txn)->mt_numdbs && (((txn)->mt_dbflags[dbi] & (validity))
> || (((txn)->mt_dbflags[dbi] & DB_UNUSED) && mdb_setup_db_info((txn), (dbi),
> (validity)))))
>
>
> The next improvement is done in any function which needs to loop through the
> databases for example in mdb_cursors_close(). Again the more databases in the
> environment the longer the execution time.
> It should be best if looping only through dbflags and searching for those
> databases which are used (!DB_UNUSED). This could be done byte wise or more
> efficient in 8/4 byte steps comparing with an extended mask DB_UNUSED_LONG
> instead of DB_UNUSED. So we can skip 8 or 4 (32 bit) unused databases in one
> step (still with the assumption that a transaction rarely uses all databases of
> the environment).
>
> So the loop looks as follows always starting at the lower index to avoid
> alignment issues with ARM prior v6.
>
> #define DB_UNUSED       0x20                /**< DB not used in this txn */
>
> #ifdef MDB_VL32
> #define DB_UNUSED_LONG  0x20202020          /* DB_UNUSED long mask for fast
> tracking */
> #else
> #define DB_UNUSED_LONG  0x2020202020202020  /* DB_UNUSED long mask for fast
> tracking */
> #endif
>
>
> #ifdef MDB_VL32
> #define MDB_WORD    unsigned int
> #else
> #define MDB_WORD    unsigned long long
> #endif
>
>
>
>      MDB_dbi %3= src->mt_numdbs;
>      MDB_dbi i = 0;
>
>      while (1) {
>          unsigned int upper = i + sizeof(MDB_WORD);
>          if (upper < n) {
>              // skip unused
>              if ((*(MDB_WORD *)(tdbflags + i)) == DB_UNUSED_LONG) {
>                  i = upper;
>                  continue;
>              }
>          }
>          else {
>              upper = n;
>          }
>
>          for (; i < upper; i++) {
>              // any other filter criteria appropriate to the function
>              ....
>          }
>          if (i >= n) {
>              break;
>          }
>      }
>
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 Hallvard Furuseth 2016-05-27 12:42:00 UTC

On 27. mai 2016 12:57, juerg.bircher@helmedica.com wrote:
> mdb_dbi_open() is however not improved with the assumption that the database
> handle (dbi) is cached in the application.  So mdb_dbi_open() should happen only
> once for each database during the life time of an application.

It must check if the DB is open anyway, since LDMB breaks if the user
has two DBIs to the same DB.  Though maybe we could add an optimization
flag to skip the search, for users who know they don't reopen open DBs.

Other notes:

LMDB and OpenLDAP mostly use tab-width = indentation = 4.  (We should
comment that somewhere.)  You seem to use tab-width = 8.

I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
long': It's probably no optimization to use 'long long' when that
is not a native integer type.  But I haven't profiled that.

The delay in picking up env->me_dbflags is a bit troubling, since it
means creating a new txn is no longer quite atomic.  I can't think of
how it'll make a difference when DBIs are used correctly, but that's one
thing people screw up.  Maybe it'd be OK if renew0() just copies it into
mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.

-- 
Hallvard

Comment 3 juerg.bircher@gmail.com 2016-05-27 12:44:01 UTC

Disclaimer



The attached patch file is derived from OpenLDAP Software. All of the modifications to OpenLDAP Software represented in the following patch(es) were developed by Juerg Bircher juerg.bircher@helmedica.com. I have not assigned rights and/or interest in this work to any party.

I, Juerg Bircher, hereby place the following modifications to OpenLDAP Software (and only these modifications) into the public domain. Hence, these modifications may be freely used and/or redistributed for any purpose with or without attribution and/or other notice.



On 27/05/16 13:16, "Howard Chu" <hyc@symas.com> wrote:

>juerg.bircher@helmedica.com wrote:
>> Full_Name: Juerg Bircher
>> Version: lmdb (master)
>> OS: MacOS / Linux
>> URL: ftp://ftp.openldap.org/incoming/Juerg_Bircher_160527-Improved-handling-for-large-number-of-databases.patch
>> Submission from: (NULL) (178.82.37.195)
>
>Thanks for the patch, but for a change of this size you also need to include 
>an IP Rights notice, as documented on the Contributing guidelines.
>>
>>
>> There is a increased performance penalty the more databases are created within
>> the same environment. I was looking for a way the improve that by keeping the
>> simplicity of tracking databases within a list with direct access by index
>> (MDB_dbi).
>> mdb_dbi_open() is however not improved with the assumption that the database
>> handle (dbi) is cached in the application. So mdb_dbi_open() should happen only
>> once for each database during the life time of an application.
>>
>>
>> One issue is that mdb_txn_begin() (for read-only transactions) calloc the
>> sizeof(MDB_txn) + me_maxdbs * sizeof(MDB_db + 1). The plus 1 for the dbflags.
>> However it is sufficient only to malloc that size and clear the sizeof(MDB_txn)
>>
>>      memset(txn, 0, sizeof(MDB_txn)
>>
>> After that the data beyond the MDB_txn is not initialized which is ok for the
>> moment.
>>
>> The next improvement happens in mdb_txn_renew0() where the dbflags are only set
>> to DB_UNUSED (a new flag) for each database currently opened in the
>> environment.
>>
>>      memset(txn->mt_dbflags, DB_UNUSED, txn->mt_numdbs);
>>
>> The former code used to to loop through each database to calculate the dbflags.
>> This is still done but lazily for each accessed database with the assumption
>> that a read only transaction rarely uses all databases of the environment.
>>
>> The lazy initialization of the dbflag happens in the macro TXN_DBI_EXIST which
>> is always used when a database handle (dbi) is passed to an function.
>> The flags are updated in mdb_setup_db_info() once a database is access which is
>> marked as unused (DB_UNUSED).
>>
>> static int mdb_setup_db_info(MDB_txn *txn, MDB_dbi dbi) {
>>      /* Setup db info */
>>      uint16_t x = txn->mt_env->me_dbflags[dbi];
>>      txn->mt_dbs[dbi].md_flags = x & PERSISTENT_FLAGS;
>>      txn->mt_dbflags[dbi] = (x & MDB_VALID) ? DB_VALID|DB_USRVALID|DB_STALE : 0;
>>      return (txn->mt_dbflags[dbi] & validity);
>> }
>>
>>      /** Check \b txn and \b dbi arguments to a function and initialize db info
>> if needed */
>> #define TXN_DBI_EXIST(txn, dbi, validity) \
>>      ((txn) && (dbi#C3C(txn)->mt_numdbs && (((txn)->mt_dbflags[dbi] & (validity))
>> || (((txn)->mt_dbflags[dbi] & DB_UNUSED) && mdb_setup_db_info((txn), (dbi),
>> (validity)))))
>>
>>
>> The next improvement is done in any function which needs to loop through the
>> databases for example in mdb_cursors_close(). Again the more databases in the
>> environment the longer the execution time.
>> It should be best if looping only through dbflags and searching for those
>> databases which are used (!DB_UNUSED). This could be done byte wise or more
>> efficient in 8/4 byte steps comparing with an extended mask DB_UNUSED_LONG
>> instead of DB_UNUSED. So we can skip 8 or 4 (32 bit) unused databases in one
>> step (still with the assumption that a transaction rarely uses all databases of
>> the environment).
>>
>> So the loop looks as follows always starting at the lower index to avoid
>> alignment issues with ARM prior v6.
>>
>> #define DB_UNUSED       0x20                /**< DB not used in this txn */
>>
>> #ifdef MDB_VL32
>> #define DB_UNUSED_LONG  0x20202020          /* DB_UNUSED long mask for fast
>> tracking */
>> #else
>> #define DB_UNUSED_LONG  0x2020202020202020  /* DB_UNUSED long mask for fast
>> tracking */
>> #endif
>>
>>
>> #ifdef MDB_VL32
>> #define MDB_WORD    unsigned int
>> #else
>> #define MDB_WORD    unsigned long long
>> #endif
>>
>>
>>
>>      MDB_dbi %3= src->mt_numdbs;
>>      MDB_dbi i = 0;
>>
>>      while (1) {
>>          unsigned int upper = i + sizeof(MDB_WORD);
>>          if (upper < n) {
>>              // skip unused
>>              if ((*(MDB_WORD *)(tdbflags + i)) == DB_UNUSED_LONG) {
>>                  i = upper;
>>                  continue;
>>              }
>>          }
>>          else {
>>              upper = n;
>>          }
>>
>>          for (; i < upper; i++) {
>>              // any other filter criteria appropriate to the function
>>              ....
>>          }
>>          if (i >= n) {
>>              break;
>>          }
>>      }
>>
>>
>>
>
>
>-- 
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 juerg.bircher@gmail.com 2016-05-27 12:57:44 UTC

Here the missing disclaimer for the patch:

The attached patch file is derived from OpenLDAP Software. All of the modifications to OpenLDAP Software represented in the following patch(es) were developed by Juerg Bircher juerg.bircher@helmedica.com. I have not assigned rights and/or interest in this work to any party.

I, Juerg Bircher, hereby place the following modifications to OpenLDAP Software (and only these modifications) into the public domain. Hence, these modifications may be freely used and/or redistributed for any purpose with or without attribution and/or other notice.

Comment 5 juerg.bircher@gmail.com 2016-05-27 13:58:43 UTC

I am switching to gmail as in outlook my name uses Umlaute which
results in base64 encoded mails! Strangely the mailing list
openldap-technical can handle it and openldap-its cannot!



Here the missing disclaimer for the patch:

The attached patch file is derived from OpenLDAP Software. All of the
modifications to OpenLDAP Software represented in the following
patch(es) were developed by Juerg Bircher juerg.bircher@helmedica.com.
I have not assigned rights and/or interest in this work to any party.

I, Juerg Bircher, hereby place the following modifications to OpenLDAP
Software (and only these modifications) into the public domain. Hence,
these modifications may be freely used and/or redistributed for any
purpose with or without attribution and/or other notice.

Comment 6 juerg.bircher@gmail.com 2016-05-27 21:39:41 UTC

First of all thank you for having a look at the improvement.

On 27/05/16 14:42, "Hallvard Breien Furuseth" <h.b.furuseth@usit.uio.no> wrote:

>On 27. mai 2016 12:57, juerg.bircher@helmedica.com wrote:
>> mdb_dbi_open() is however not improved with the assumption that the database
>> handle (dbi) is cached in the application.  So mdb_dbi_open() should happen only
>> once for each database during the life time of an application.
>
>It must check if the DB is open anyway, since LDMB breaks if the user
>has two DBIs to the same DB.  Though maybe we could add an optimization
>flag to skip the search, for users who know they don't reopen open DBs.
>
>Other notes:
>
>LMDB and OpenLDAP mostly use tab-width = indentation = 4.  (We should
>comment that somewhere.)  You seem to use tab-width = 8.

I will consider that for the next patch!

>
>I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
>long': It's probably no optimization to use 'long long' when that
>is not a native integer type.  But I haven't profiled that.

The MDB_WORD is used to get the appropriate word size of the CPU for
the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
integer and on a 64 bit CPUs a 64 bit integer. But the mdb_size_t
seems always to be 64 bits?

>
>The delay in picking up env->me_dbflags is a bit troubling, since it
>means creating a new txn is no longer quite atomic.  I can't think of
>how it'll make a difference when DBIs are used correctly, but that's one
>thing people screw up.  Maybe it'd be OK if renew0() just copies it into
>mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.

Is it really atomic in original lmdb? The flags are copied from the
environment and calculated. But while copying the dbflags from the
environment they may change. So read level consistency and therefore
atomicity is not guaranteed. But I guess it just does not matter.
So picking up the flags lazily just extends the time of copying the
flags from the environment but is basically the same.
Atomicity and also isolation is a must in respect to the data but I
think not in respect to the database handles and their flags.
Maybe I am overseeing something as I do not know all the thoughts
between the lines.

>
>--
>Hallvard

Regards
Juerg

Comment 7 Hallvard Furuseth 2016-05-28 01:28:59 UTC

On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
> On 27/05/16 14:42, "Hallvard Breien Furuseth" <h.b.furuseth@usit.uio.no> wrote:
>> I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
>> long': It's probably no optimization to use 'long long' when that
>> is not a native integer type.  But I haven't profiled that.
>
> The MDB_WORD is used to get the appropriate word size of the CPU for
> the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
> integer and on a 64 bit CPUs a 64 bit integer.

That's why I suggested unsigned long.  It seems the best guess for a
native wide type.  But it's hard to guess what kind of CPU is in use.

> But the mdb_size_t seems always to be 64 bits?

On a 32-bit host it's normally 32-bit, but you get 64-bit mdb_size_t 
anyway if you build and use LMDB with MDB_VL32 defined.  I misread
your patch, it should not use MDB_VL32.

>> The delay in picking up env->me_dbflags is a bit troubling, since it
>> means creating a new txn is no longer quite atomic.  I can't think of
>> how it'll make a difference when DBIs are used correctly, but that's one
>> thing people screw up.  Maybe it'd be OK if renew0() just copies it into
>> mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.
>
> Is it really atomic in original lmdb? The flags are copied from the
> environment and calculated. But while copying the dbflags from the
> environment they may change. So read level consistency and therefore
> atomicity is not guaranteed.

I should have said, it's atomic when DBIs are used correctly.  Such as
obeying this limitation from the mdb_dbi_open() doc and a similar one
for mdb_dbi_close():

* A transaction that uses
* this function must finish (either commit or abort) before
* any other transaction in the process may use this function.

To guarante atomicity we'd need locks around some DBI operations.
To avoid that, LMDB simply calls unsafe use "user errors" - which can
cause breakage.  Hence those annoying DBI usage restrictions.

LMDB does attempt to catch user errors when it's cheap to try, see e.g.
TXN_DBI_CHANGED().  Doesn't work at all for read-only txns, they have
txn->mt_dbiseqs == env->me_dbiseqs.

So you can see why DBI cleverness makes me nervous.
I think the MDB_WORD looping can go in now, but we'll need to think
carefully about the lazy init.

> But I guess it just does not matter.
> So picking up the flags lazily just extends the time of copying the
> flags from the environment but is basically the same.
> Atomicity and also isolation is a must in respect to the data but I
> think not in respect to the database handles and their flags.
> Maybe I am overseeing something as I do not know all the thoughts
> between the lines.

With ITS#8431 I can think of one thing:

There's no sync between mdb_dbi_open() in one txn and another txn
using the new DBI.  It'll be up to the user to synchronize DBIs.  For
that to work in write-txs, TXN_DBI_CHANGED() must go away, and writers
will have no way to try to catch user errors with sync.  Or if it
should only work in read-only txns, then read-only txns and write-txns
must have different definitions of what is a "user error" with DBIs.

Anyway, we'll need to think about the definition of user errors.

Other notes:

/** ... */ above a global declaration or /**< ... */ to the right
of it tells Doxygen that the comment is the doc of the declared item.
"#foo" inside a Doxygen comment is a link to the declaration of foo.

Maybe mt_dbflags should have an extra byte or word at the end which
does not have DB_UNUSED, and with size rounded up to sizeof(MDB_WORD).
Then stepping to next DBI can do something like this, unrelated to
the check against mt_numdbs.  Not sure if that's an improvement, I'm
getting tired:

/** Return an #MDB_WORD where each byte is \b byte */
#define WORD_BYTES(byte)	((MDB_WORD)-1 / 0xff * (byte))

/** Step up to next used DBI */
#define DBI_STEP(txn, dbi) do { \
     if (! (++(dbi) & (sizeof(mdb_word_t)-1))) \
         while (*(mdb_word_t *)((txn)->mt_dbflags + (dbi)) == 
WORD_BYTES(DB_UNUSED))
             (dbi) += sizeof(MDB_WORD); \
} while (0)

Comment 8 Hallvard Furuseth 2016-05-28 01:41:25 UTC

On 28/05/16 03:28, Hallvard Breien Furuseth wrote:
>On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
>> Maybe I am overseeing something as I do not know all the thoughts
>> between the lines.
>
> With ITS#8431 I can think of one thing:
> (...blah...)

Actually the main reason I worry about lazy init is simply that
we've screwed up DBIs many times, so I prefer to go slowly.
That's Howard's call though, not mine.

Comment 9 juerg.bircher@gmail.com 2016-05-28 20:08:42 UTC

On Sat, May 28, 2016 at 3:28 AM, Hallvard Breien Furuseth
<h.b.furuseth@usit.uio.no> wrote:
> On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
>>
>> On 27/05/16 14:42, "Hallvard Breien Furuseth" <h.b.furuseth@usit.uio.no>
>> wrote:
>>>
>>> I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
>>> long': It's probably no optimization to use 'long long' when that
>>> is not a native integer type.  But I haven't profiled that.
>>
>>
>> The MDB_WORD is used to get the appropriate word size of the CPU for
>> the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
>> integer and on a 64 bit CPUs a 64 bit integer.
>
>
> That's why I suggested unsigned long.  It seems the best guess for a
> native wide type.  But it's hard to guess what kind of CPU is in use.
>
>> But the mdb_size_t seems always to be 64 bits?
>
>
> On a 32-bit host it's normally 32-bit, but you get 64-bit mdb_size_t anyway
> if you build and use LMDB with MDB_VL32 defined.  I misread
> your patch, it should not use MDB_VL32.
>
>>> The delay in picking up env->me_dbflags is a bit troubling, since it
>>> means creating a new txn is no longer quite atomic.  I can't think of
>>> how it'll make a difference when DBIs are used correctly, but that's one
>>> thing people screw up.  Maybe it'd be OK if renew0() just copies it into
>>> mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.
>>
>>
>> Is it really atomic in original lmdb? The flags are copied from the
>> environment and calculated. But while copying the dbflags from the
>> environment they may change. So read level consistency and therefore
>> atomicity is not guaranteed.
>
>
> I should have said, it's atomic when DBIs are used correctly.  Such as
> obeying this limitation from the mdb_dbi_open() doc and a similar one
> for mdb_dbi_close():
>
> * A transaction that uses
> * this function must finish (either commit or abort) before
> * any other transaction in the process may use this function.
>
> To guarante atomicity we'd need locks around some DBI operations.
> To avoid that, LMDB simply calls unsafe use "user errors" - which can
> cause breakage.  Hence those annoying DBI usage restrictions.
>
> LMDB does attempt to catch user errors when it's cheap to try, see e.g.
> TXN_DBI_CHANGED().  Doesn't work at all for read-only txns, they have
> txn->mt_dbiseqs == env->me_dbiseqs.
>
> So you can see why DBI cleverness makes me nervous.
> I think the MDB_WORD looping can go in now, but we'll need to think
> carefully about the lazy init.

I understand that it is a tricky thing. But maybe the lazy init is
still feasible? In my understand without having all the background it
seems ok to get the current snapshot of the database flags from the
env once it is firstly accessed whether it is at the beginning or
later in the transaction.
Read level consistency should be preserved anyway as mt_txnId is set
in mdb_txn_renew0().

>
>> But I guess it just does not matter.
>> So picking up the flags lazily just extends the time of copying the
>> flags from the environment but is basically the same.
>> Atomicity and also isolation is a must in respect to the data but I
>> think not in respect to the database handles and their flags.
>> Maybe I am overseeing something as I do not know all the thoughts
>> between the lines.
>
>
> With ITS#8431 I can think of one thing:
>
> There's no sync between mdb_dbi_open() in one txn and another txn
> using the new DBI.  It'll be up to the user to synchronize DBIs.  For
> that to work in write-txs, TXN_DBI_CHANGED() must go away, and writers
> will have no way to try to catch user errors with sync.  Or if it
> should only work in read-only txns, then read-only txns and write-txns
> must have different definitions of what is a "user error" with DBIs.
>
> Anyway, we'll need to think about the definition of user errors.
>

So if you drop ITS#8431 it would at least be very helpful to get an
separate error like MDB_INVALID_DPI instead of EINVAL So the
application sharing and caching handles (dbi) could handle the
situation appropriate like use mdb_env_open() so that the dbi becomes
valid in that txn as well.

>
> Other notes:
>
> /** ... */ above a global declaration or /**< ... */ to the right
> of it tells Doxygen that the comment is the doc of the declared item.
> "#foo" inside a Doxygen comment is a link to the declaration of foo.
>
>
> Maybe mt_dbflags should have an extra byte or word at the end which
> does not have DB_UNUSED, and with size rounded up to sizeof(MDB_WORD).
> Then stepping to next DBI can do something like this, unrelated to
> the check against mt_numdbs.  Not sure if that's an improvement, I'm
> getting tired:
>
> /** Return an #MDB_WORD where each byte is \b byte */
> #define WORD_BYTES(byte)        ((MDB_WORD)-1 / 0xff * (byte))
>
> /** Step up to next used DBI */
> #define DBI_STEP(txn, dbi) do { \
>     if (! (++(dbi) & (sizeof(mdb_word_t)-1))) \
>         while (*(mdb_word_t *)((txn)->mt_dbflags + (dbi)) ==
> WORD_BYTES(DB_UNUSED))
>             (dbi) += sizeof(MDB_WORD); \
> } while (0)
>

Yes round it up to sizeof(MDB_WORD) would be obvious in connection
with the optimized loop. If you are willing to add some more #ifdef we
could even get it faster at least for x86_64 using SSE2 instructions.
Then MDB_WORD would be 128 bit and we check 16 flags in one step using
_mm_cmpeq_epi8 and _mm_movemask_epi8 to get the mask of matching
flags. If the mask != 0xffffffff we found an used database. Then we
negate the mask and with __builtin_ffs(mask) we could then find the
index of the first used database.

Comment 10 juerg.bircher@gmail.com 2016-05-28 20:35:44 UTC

---------- Forwarded message ----------
From: Juerg Bircher <juerg.bircher@gmail.com>
Date: Sat, May 28, 2016 at 10:30 PM
Subject: Re: (ITS#8430) Improved handling for large number of databases
To: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>

On Sat, May 28, 2016 at 3:41 AM, Hallvard Breien Furuseth
<h.b.furuseth@usit.uio.no> wrote:
> On 28/05/16 03:28, Hallvard Breien Furuseth wrote:
>>
>> On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
>>>
>>> Maybe I am overseeing something as I do not know all the thoughts
>>> between the lines.
>>
>>
>> With ITS#8431 I can think of one thing:
>> (...blah...)
>
>
> Actually the main reason I worry about lazy init is simply that
> we've screwed up DBIs many times, so I prefer to go slowly.
> That's Howard's call though, not mine.

Maybe a good way in between could be to memcpy
txn->mt_env->me_dbflags. However md_flags would need to be moved out
of the MDB_dbi struct like mt_dbflags or easier just add another array
serving as snapshot. This would not break anything with flag
initialization as we get a snapshot of them in the mdb_txn_renew0.
memcpy uses SSE instructions for x86_64. We should consider to align
on 16 byte boundary for best performance.
The dbflags could then still be initialized with MDB_UNSUSED as suggested.

Comment 11 Hallvard Furuseth 2016-05-30 11:51:13 UTC

On 28/05/16 22:08, Juerg Bircher wrote:
> I understand that it is a tricky thing. But maybe the lazy init is
> still feasible? (...)

Right, lazy init is feasible.  Just keep cross-txn DBI stuff out of
it, I'll take your latest variant to ITS#8431.

Copied from a private mail:
> Maybe a good way in between could be to memcpy txn->mt_env->me_dbflags.
> (...) memcpy uses SSE instructions for x86_64.

Yes. If we make MDB_txn.mt_dbflags an uint16_t*, we can memcpy to it.
That costs <maxdbs> bytes/txn + one more reserved env flag bit in
PERSISTENT_FLAGS: Reserve DB_UNUSED, and keep it always set in
me_dbflags so the memcpy will set it in mt_dbflags.

Or maybe your code can be used without initial memcpy and I'm just
paranoid.  The one problem I can think of is user errors - people
coming up with workarounds like your cross-txn DBI stuff.  LMDB tries
to catch some of that, but maybe late pickup defeats some such checks.

I'll wait for Howard to say something at this point.

BTW, all this reminds me - I should add a test which tries to catch
mdb_dbi_open() in overlapping txns.  Got the code, but forgot about
it while pondering a good error code name.

> However md_flags would need to be moved out of the MDB_dbi struct

Nope.  Sub-DBs with duplicate data have md_flags which do not
correspond to anything saved in env->me_flags.  Your lazy-init code
needs to use the memcpyed flags instead of me_dbflags, that's all.

> We should consider to align on 16 byte boundary for best performance.

It's already aligned.

Comment 12 Hallvard Furuseth 2016-05-30 15:08:26 UTC

On 30. mai 2016 13:51, Hallvard Breien Furuseth wrote:
> Copied from a private mail:
>> Maybe a good way in between could be to memcpy txn->mt_env->me_dbflags.
>> (...) memcpy uses SSE instructions for x86_64.
>
> Yes. If we make MDB_txn.mt_dbflags an uint16_t*, we can memcpy to it.
> That costs <maxdbs> bytes/txn + one more reserved env flag bit in
> PERSISTENT_FLAGS: Reserve DB_UNUSED, and keep it always set in
> me_dbflags so the memcpy will set it in mt_dbflags.

Oops, that was too clever.  Then a bunch of code which assumes a DBI
is valid if various txn flags are set, must also check DB_UNUSED.
Or invert that to DB_USED which might be easier - (flags & DB_DIRTY)
becomes F_ISSET(flags, DB_DIRTY|DB_USED) which checks if both are set.

Maybe the simple solution using 2*<maxdbs> bytes/txn would be better.

-- 
Hallvard