Issue 7842 - mdb readers/writer exclusion protocol, as implemented, is racy.
Summary: mdb readers/writer exclusion protocol, as implemented, is racy.
Status: VERIFIED DUPLICATE of issue 7969
Alias: None
Product: LMDB
Classification: Unclassified
Component: liblmdb (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-22 18:02 UTC by rsbx@acm.org
Modified: 2021-06-02 15:03 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description rsbx@acm.org 2014-04-22 18:02:58 UTC
Full_Name: Raymond S Brand
Version: mdb.master
OS: linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (107.145.137.13)


[I would have uploaded this to the ftp.openldap.org but there was no space.]

As implemented, the mdb readers/writer exclusion protocol has a race condition
that could result in a writer reclaiming and over-writing pages still in use
by a reader.

Pseudo code of the snapshot locking protocols of reader and writer
transactions. The labeled sections, for the purposes of this analysis, can be
assumed to execute atomically.

READER
======

*R1*    t0 = meta[0].txid;  t1 = meta[1].txid
        if (t1 > t0)
                t0 = t1

*R2*    readers[me].txid = t0

*R3*    snapshot_root = meta[t0&1].snapshot

*R4*    /* lookup data */

*R5*    readers[me].txid = -1           // Release snapshot

WRITER
======

*W1*    lock(writer, EXCLUSIVE)
        curr_txid = meta[0].txid
        if (meta[1].txid > curr_txid)
                curr_txid = meta[1].txid

*W2*    oldest = curr_txid
        for (i=0; i<reader_slots; i++)
                t = readers[i].txid
                if (t != -1 && t < oldest)
                        oldest = t

*W3*    reclaim_old_pages(oldest)

        /*
        ** Commit new transaction
        */

*W4*    unlock(writer)

---------------------------------------------------------------------------

Adversarial scheduling analysis:

The following timeline demonstrates that a writer can reclaim pages in use by a
reader.

T0  T1  T2  T3  T4     T5  T6  T7  T8  T9  T10
R1                             R2  R3      R4
    W1  W2  W3  W4  *  W1  W2          W3

T0      Reader has found the latest txid, Xn.

T1->T4  Reader is not scheduled to run and a write transaction commits, Xn+1.

T5      Reader is still not scheduled and another write transaction starts.

T6      Writer finds that the oldest referenced transaction is the last
        committed transaction, Xn+1.

T7      Reader records Xn in the reader table.

T8      Reader gets the snapshot root page for transaction Xn.

T9      Writer, believing that Xn+1 is the oldest reference transaction,
        reclaims the pages of transaction Xn.


T10     Reader attempts to navigate pages of a transaction, Xn, that has been
        reclaimed and "Bad Things Happen".

In fact, bad things will happen if an even number of write transactions commit
between W4 and W1, represented by the '*', in the timeline above.

The fix is to search for the highest txid in R1 and between R2 and R3.
Optionally, followed by recording the txid, of the snapshot root found in R3,
in the reader's reader table slot to, possibly, increase the number of
reclaimable transactions.

The lack of compiler and memory barriers in the implementation of the locking
protocol is also of concern.

Beyond the above, the code in mdb_txn_renew0() after the
        "/* Copy the DB info and flags */"
comment appears to have a number of data races.

---
 libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
index e0f551e..908417c 100644
--- a/libraries/liblmdb/mdb.c
+++ b/libraries/liblmdb/mdb.c
@@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
         *      started from so we can avoid overwriting any data used in that
         *      particular version.
         */
-       txnid_t         mrb_txnid;
+       volatile txnid_t mrb_txnid;
        /** The process ID of the process owning this reader txn. */
-       MDB_PID_T       mrb_pid;
+       volatile MDB_PID_T mrb_pid;
        /** The thread ID of the thread owning this txn. */
-       pthread_t       mrb_tid;
+       volatile pthread_t mrb_tid;
 } MDB_rxbody;
 
        /** The actual reader record, with cacheline padding. */
@@ -585,12 +585,12 @@ typedef struct MDB_txbody {
                 *      This is recorded here only for convenience; the value
can always
                 *      be determined by reading the main database meta pages.
                 */
-       txnid_t         mtb_txnid;
+       volatile txnid_t mtb_txnid;
                /** The number of slots that have been used in the reader
table.
                 *      This always records the maximum count, it is not
decremented
                 *      when readers release their slots.
                 */
-       unsigned        mtb_numreaders;
+       volatile unsigned mtb_numreaders;
 } MDB_txbody;
 
        /** The actual reader table definition. */
@@ -854,7 +854,7 @@ typedef struct MDB_meta {
        /** Any persistent environment flags. @ref mdb_env */
 #define        mm_flags        mm_dbs[0].md_flags
        pgno_t          mm_last_pg;                     /**< last used page in
file */
-       txnid_t         mm_txnid;                       /**< txnid that
committed this page */
+       volatile txnid_t mm_txnid;              /**< txnid that committed this
page */
 } MDB_meta;
 
        /** Buffer for a stack-allocated meta page.
@@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
 
        if (txn->mt_flags & MDB_TXN_RDONLY) {
                if (!ti) {
+                       /* No readers table; app responsible for locking */
                        meta = env->me_metas[ mdb_env_pick_meta(env) ];
                        txn->mt_txnid = meta->mm_txnid;
                        txn->mt_u.reader = NULL;
                } else {
+                       /* Has readers table */
                        MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
txn->mt_u.reader :
                                pthread_getspecific(env->me_txkey);
                        if (r) {
@@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
                                        return rc;
                                }
                        }
-                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
                        txn->mt_u.reader = r;
+                       r->mr_txnid = ti->mti_txnid;
+
+                       /* Really need a memory barrier here */
+
+                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
                        meta = env->me_metas[txn->mt_txnid & 1];
                }
        } else {
@@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
        }
 
        /* Copy the DB info and flags */
-       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
+       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));  /* Racy */
 
        /* Moved to here to avoid a data race in read TXNs */
-       txn->mt_next_pgno = meta->mm_last_pg+1;
+       txn->mt_next_pgno = meta->mm_last_pg+1;                 /* Racy */
 
        for (i=2; i<txn->mt_numdbs; i++) {
                x = env->me_dbflags[i];
-- 
1.9.0.138.g2de3478
Comment 1 Howard Chu 2014-05-17 00:09:01 UTC
rsbx@acm.org wrote:
> Full_Name: Raymond S Brand
> Version: mdb.master
> OS: linux
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (107.145.137.13)
>
>
> [I would have uploaded this to the ftp.openldap.org but there was no space.]
>
> As implemented, the mdb readers/writer exclusion protocol has a race condition
> that could result in a writer reclaiming and over-writing pages still in use
> by a reader.

Yes, you're basically correct. But this is unlikely in practice because the 
reader thread has no blocking calls in its codepath, while the writer must 
acquire locks etc. Generally the only way for two write txns to complete while 
a reader thread is stalled is if you explicitly send a SIGSTOP to the reader 
thread while it's in its critical section.

>
> Pseudo code of the snapshot locking protocols of reader and writer
> transactions. The labeled sections, for the purposes of this analysis, can be
> assumed to execute atomically.
>
> READER
> ======
>
> *R1*    t0 = meta[0].txid;  t1 = meta[1].txid
>          if (t1 > t0)
>                  t0 = t1
>
> *R2*    readers[me].txid = t0
>
> *R3*    snapshot_root = meta[t0&1].snapshot
>
> *R4*    /* lookup data */
>
> *R5*    readers[me].txid = -1           // Release snapshot
>
> WRITER
> ======
>
> *W1*    lock(writer, EXCLUSIVE)
>          curr_txid = meta[0].txid
>          if (meta[1].txid > curr_txid)
>                  curr_txid = meta[1].txid
>
> *W2*    oldest = curr_txid
>          for (i=0; i<reader_slots; i++)
>                  t = readers[i].txid
>                  if (t != -1 && t < oldest)
>                          oldest = t
>
> *W3*    reclaim_old_pages(oldest)
>
>          /*
>          ** Commit new transaction
>          */
>
> *W4*    unlock(writer)
>
> ---------------------------------------------------------------------------
>
> Adversarial scheduling analysis:
>
> The following timeline demonstrates that a writer can reclaim pages in use by a
> reader.
>
> T0  T1  T2  T3  T4     T5  T6  T7  T8  T9  T10
> R1                             R2  R3      R4
>      W1  W2  W3  W4  *  W1  W2          W3
>
> T0      Reader has found the latest txid, Xn.
>
> T1->T4  Reader is not scheduled to run and a write transaction commits, Xn+1.
>
> T5      Reader is still not scheduled and another write transaction starts.
>
> T6      Writer finds that the oldest referenced transaction is the last
>          committed transaction, Xn+1.
>
> T7      Reader records Xn in the reader table.
>
> T8      Reader gets the snapshot root page for transaction Xn.
>
> T9      Writer, believing that Xn+1 is the oldest reference transaction,
>          reclaims the pages of transaction Xn.
>
>
> T10     Reader attempts to navigate pages of a transaction, Xn, that has been
>          reclaimed and "Bad Things Happen".
>
> In fact, bad things will happen if an even number of write transactions commit
> between W4 and W1, represented by the '*', in the timeline above.
>
> The fix is to search for the highest txid in R1 and between R2 and R3.
> Optionally, followed by recording the txid, of the snapshot root found in R3,
> in the reader's reader table slot to, possibly, increase the number of
> reclaimable transactions.
>
> The lack of compiler and memory barriers in the implementation of the locking
> protocol is also of concern.
>
> Beyond the above, the code in mdb_txn_renew0() after the
>          "/* Copy the DB info and flags */"
> comment appears to have a number of data races.
>
> ---
>   libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
>   1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
> index e0f551e..908417c 100644
> --- a/libraries/liblmdb/mdb.c
> +++ b/libraries/liblmdb/mdb.c
> @@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
>           *      started from so we can avoid overwriting any data used in that
>           *      particular version.
>           */
> -       txnid_t         mrb_txnid;
> +       volatile txnid_t mrb_txnid;
>          /** The process ID of the process owning this reader txn. */
> -       MDB_PID_T       mrb_pid;
> +       volatile MDB_PID_T mrb_pid;
>          /** The thread ID of the thread owning this txn. */
> -       pthread_t       mrb_tid;
> +       volatile pthread_t mrb_tid;
>   } MDB_rxbody;
>
>          /** The actual reader record, with cacheline padding. */
> @@ -585,12 +585,12 @@ typedef struct MDB_txbody {
>                   *      This is recorded here only for convenience; the value
> can always
>                   *      be determined by reading the main database meta pages.
>                   */
> -       txnid_t         mtb_txnid;
> +       volatile txnid_t mtb_txnid;
>                  /** The number of slots that have been used in the reader
> table.
>                   *      This always records the maximum count, it is not
> decremented
>                   *      when readers release their slots.
>                   */
> -       unsigned        mtb_numreaders;
> +       volatile unsigned mtb_numreaders;
>   } MDB_txbody;
>
>          /** The actual reader table definition. */
> @@ -854,7 +854,7 @@ typedef struct MDB_meta {
>          /** Any persistent environment flags. @ref mdb_env */
>   #define        mm_flags        mm_dbs[0].md_flags
>          pgno_t          mm_last_pg;                     /**< last used page in
> file */
> -       txnid_t         mm_txnid;                       /**< txnid that
> committed this page */
> +       volatile txnid_t mm_txnid;              /**< txnid that committed this
> page */
>   } MDB_meta;
>
>          /** Buffer for a stack-allocated meta page.
> @@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
>
>          if (txn->mt_flags & MDB_TXN_RDONLY) {
>                  if (!ti) {
> +                       /* No readers table; app responsible for locking */
>                          meta = env->me_metas[ mdb_env_pick_meta(env) ];
>                          txn->mt_txnid = meta->mm_txnid;
>                          txn->mt_u.reader = NULL;
>                  } else {
> +                       /* Has readers table */
>                          MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
> txn->mt_u.reader :
>                                  pthread_getspecific(env->me_txkey);
>                          if (r) {
> @@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
>                                          return rc;
>                                  }
>                          }
> -                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>                          txn->mt_u.reader = r;
> +                       r->mr_txnid = ti->mti_txnid;
> +
> +                       /* Really need a memory barrier here */
> +
> +                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>                          meta = env->me_metas[txn->mt_txnid & 1];
>                  }
>          } else {
> @@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
>          }
>
>          /* Copy the DB info and flags */
> -       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
> +       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));  /* Racy */
>
>          /* Moved to here to avoid a data race in read TXNs */
> -       txn->mt_next_pgno = meta->mm_last_pg+1;
> +       txn->mt_next_pgno = meta->mm_last_pg+1;                 /* Racy */
>
>          for (i=2; i<txn->mt_numdbs; i++) {
>                  x = env->me_dbflags[i];
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 rsbx@acm.org 2014-05-17 13:48:26 UTC
On 05/16/2014 08:09 PM, Howard Chu wrote:
> rsbx@acm.org wrote:
>> Full_Name: Raymond S Brand
>> Version: mdb.master
>> OS: linux
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (107.145.137.13)
>>
>>
>> [I would have uploaded this to the ftp.openldap.org but there was no
>> space.]
>>
>> As implemented, the mdb readers/writer exclusion protocol has a race
>> condition
>> that could result in a writer reclaiming and over-writing pages still
>> in use
>> by a reader.
>
> Yes, you're basically correct. But this is unlikely in practice because
> the reader thread has no blocking calls in its codepath, while the
> writer must acquire locks etc. Generally the only way for two write txns
> to complete while a reader thread is stalled is if you explicitly send a
> SIGSTOP to the reader thread while it's in its critical section.

Or just unlucky process scheduling.

>
>>
>> Pseudo code of the snapshot locking protocols of reader and writer
>> transactions. The labeled sections, for the purposes of this analysis,
>> can be
>> assumed to execute atomically.
>>
>> READER
>> ======
>>
>> *R1*    t0 = meta[0].txid;  t1 = meta[1].txid
>>          if (t1 > t0)
>>                  t0 = t1
>>
>> *R2*    readers[me].txid = t0
>>
>> *R3*    snapshot_root = meta[t0&1].snapshot
>>
>> *R4*    /* lookup data */
>>
>> *R5*    readers[me].txid = -1           // Release snapshot
>>
>> WRITER
>> ======
>>
>> *W1*    lock(writer, EXCLUSIVE)
>>          curr_txid = meta[0].txid
>>          if (meta[1].txid > curr_txid)
>>                  curr_txid = meta[1].txid
>>
>> *W2*    oldest = curr_txid
>>          for (i=0; i<reader_slots; i++)
>>                  t = readers[i].txid
>>                  if (t != -1 && t < oldest)
>>                          oldest = t
>>
>> *W3*    reclaim_old_pages(oldest)
>>
>>          /*
>>          ** Commit new transaction
>>          */
>>
>> *W4*    unlock(writer)
>>
>> ---------------------------------------------------------------------------
>>
>>
>> Adversarial scheduling analysis:
>>
>> The following timeline demonstrates that a writer can reclaim pages in
>> use by a
>> reader.
>>
>> T0  T1  T2  T3  T4     T5  T6  T7  T8  T9  T10
>> R1                             R2  R3      R4
>>      W1  W2  W3  W4  *  W1  W2          W3
>>
>> T0      Reader has found the latest txid, Xn.
>>
>> T1->T4  Reader is not scheduled to run and a write transaction
>> commits, Xn+1.
>>
>> T5      Reader is still not scheduled and another write transaction
>> starts.
>>
>> T6      Writer finds that the oldest referenced transaction is the last
>>          committed transaction, Xn+1.
>>
>> T7      Reader records Xn in the reader table.
>>
>> T8      Reader gets the snapshot root page for transaction Xn.
>>
>> T9      Writer, believing that Xn+1 is the oldest reference transaction,
>>          reclaims the pages of transaction Xn.
>>
>>
>> T10     Reader attempts to navigate pages of a transaction, Xn, that
>> has been
>>          reclaimed and "Bad Things Happen".
>>
>> In fact, bad things will happen if an even number of write
>> transactions commit
>> between W4 and W1, represented by the '*', in the timeline above.
>>
>> The fix is to search for the highest txid in R1 and between R2 and R3.
>> Optionally, followed by recording the txid, of the snapshot root found
>> in R3,
>> in the reader's reader table slot to, possibly, increase the number of
>> reclaimable transactions.
>>
>> The lack of compiler and memory barriers in the implementation of the
>> locking
>> protocol is also of concern.
>>
>> Beyond the above, the code in mdb_txn_renew0() after the
>>          "/* Copy the DB info and flags */"
>> comment appears to have a number of data races.
>>
>> ---
>>   libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
>>   1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
>> index e0f551e..908417c 100644
>> --- a/libraries/liblmdb/mdb.c
>> +++ b/libraries/liblmdb/mdb.c
>> @@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
>>           *      started from so we can avoid overwriting any data
>> used in that
>>           *      particular version.
>>           */
>> -       txnid_t         mrb_txnid;
>> +       volatile txnid_t mrb_txnid;
>>          /** The process ID of the process owning this reader txn. */
>> -       MDB_PID_T       mrb_pid;
>> +       volatile MDB_PID_T mrb_pid;
>>          /** The thread ID of the thread owning this txn. */
>> -       pthread_t       mrb_tid;
>> +       volatile pthread_t mrb_tid;
>>   } MDB_rxbody;
>>
>>          /** The actual reader record, with cacheline padding. */
>> @@ -585,12 +585,12 @@ typedef struct MDB_txbody {
>>                   *      This is recorded here only for convenience;
>> the value
>> can always
>>                   *      be determined by reading the main database
>> meta pages.
>>                   */
>> -       txnid_t         mtb_txnid;
>> +       volatile txnid_t mtb_txnid;
>>                  /** The number of slots that have been used in the
>> reader
>> table.
>>                   *      This always records the maximum count, it is not
>> decremented
>>                   *      when readers release their slots.
>>                   */
>> -       unsigned        mtb_numreaders;
>> +       volatile unsigned mtb_numreaders;
>>   } MDB_txbody;
>>
>>          /** The actual reader table definition. */
>> @@ -854,7 +854,7 @@ typedef struct MDB_meta {
>>          /** Any persistent environment flags. @ref mdb_env */
>>   #define        mm_flags        mm_dbs[0].md_flags
>>          pgno_t          mm_last_pg;                     /**< last
>> used page in
>> file */
>> -       txnid_t         mm_txnid;                       /**< txnid that
>> committed this page */
>> +       volatile txnid_t mm_txnid;              /**< txnid that
>> committed this
>> page */
>>   } MDB_meta;
>>
>>          /** Buffer for a stack-allocated meta page.
>> @@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
>>
>>          if (txn->mt_flags & MDB_TXN_RDONLY) {
>>                  if (!ti) {
>> +                       /* No readers table; app responsible for
>> locking */
>>                          meta = env->me_metas[ mdb_env_pick_meta(env) ];
>>                          txn->mt_txnid = meta->mm_txnid;
>>                          txn->mt_u.reader = NULL;
>>                  } else {
>> +                       /* Has readers table */
>>                          MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
>> txn->mt_u.reader :
>>                                  pthread_getspecific(env->me_txkey);
>>                          if (r) {
>> @@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
>>                                          return rc;
>>                                  }
>>                          }
>> -                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>>                          txn->mt_u.reader = r;
>> +                       r->mr_txnid = ti->mti_txnid;
>> +
>> +                       /* Really need a memory barrier here */
>> +
>> +                       txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>>                          meta = env->me_metas[txn->mt_txnid & 1];
>>                  }
>>          } else {
>> @@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
>>          }
>>
>>          /* Copy the DB info and flags */
>> -       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
>> +       memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));  /*
>> Racy */
>>
>>          /* Moved to here to avoid a data race in read TXNs */
>> -       txn->mt_next_pgno = meta->mm_last_pg+1;
>> +       txn->mt_next_pgno = meta->mm_last_pg+1;                 /*
>> Racy */
>>
>>          for (i=2; i<txn->mt_numdbs; i++) {
>>                  x = env->me_dbflags[i];
>>
>
>

Comment 3 Howard Chu 2021-06-02 12:53:59 UTC

*** This issue has been marked as a duplicate of issue 7969 ***