[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#7842) mdb readers/writer exclusion protocol, as implemented, is racy.
- To: openldap-its@OpenLDAP.org
- Subject: Re: (ITS#7842) mdb readers/writer exclusion protocol, as implemented, is racy.
- From: rsbx@acm.org
- Date: Sat, 17 May 2014 13:49:02 GMT
- Auto-submitted: auto-generated (OpenLDAP-ITS)
On 05/16/2014 08:09 PM, Howard Chu wrote:
> rsbx@acm.org wrote:
>> Full_Name: Raymond S Brand
>> Version: mdb.master
>> OS: linux
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (107.145.137.13)
>>
>>
>> [I would have uploaded this to the ftp.openldap.org but there was no
>> space.]
>>
>> As implemented, the mdb readers/writer exclusion protocol has a race
>> condition
>> that could result in a writer reclaiming and over-writing pages still
>> in use
>> by a reader.
>
> Yes, you're basically correct. But this is unlikely in practice because
> the reader thread has no blocking calls in its codepath, while the
> writer must acquire locks etc. Generally the only way for two write txns
> to complete while a reader thread is stalled is if you explicitly send a
> SIGSTOP to the reader thread while it's in its critical section.
Or just unlucky process scheduling.
>
>>
>> Pseudo code of the snapshot locking protocols of reader and writer
>> transactions. The labeled sections, for the purposes of this analysis,
>> can be
>> assumed to execute atomically.
>>
>> READER
>> ======
>>
>> *R1* t0 = meta[0].txid; t1 = meta[1].txid
>> if (t1 > t0)
>> t0 = t1
>>
>> *R2* readers[me].txid = t0
>>
>> *R3* snapshot_root = meta[t0&1].snapshot
>>
>> *R4* /* lookup data */
>>
>> *R5* readers[me].txid = -1 // Release snapshot
>>
>> WRITER
>> ======
>>
>> *W1* lock(writer, EXCLUSIVE)
>> curr_txid = meta[0].txid
>> if (meta[1].txid > curr_txid)
>> curr_txid = meta[1].txid
>>
>> *W2* oldest = curr_txid
>> for (i=0; i<reader_slots; i++)
>> t = readers[i].txid
>> if (t != -1 && t < oldest)
>> oldest = t
>>
>> *W3* reclaim_old_pages(oldest)
>>
>> /*
>> ** Commit new transaction
>> */
>>
>> *W4* unlock(writer)
>>
>> ---------------------------------------------------------------------------
>>
>>
>> Adversarial scheduling analysis:
>>
>> The following timeline demonstrates that a writer can reclaim pages in
>> use by a
>> reader.
>>
>> T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
>> R1 R2 R3 R4
>> W1 W2 W3 W4 * W1 W2 W3
>>
>> T0 Reader has found the latest txid, Xn.
>>
>> T1->T4 Reader is not scheduled to run and a write transaction
>> commits, Xn+1.
>>
>> T5 Reader is still not scheduled and another write transaction
>> starts.
>>
>> T6 Writer finds that the oldest referenced transaction is the last
>> committed transaction, Xn+1.
>>
>> T7 Reader records Xn in the reader table.
>>
>> T8 Reader gets the snapshot root page for transaction Xn.
>>
>> T9 Writer, believing that Xn+1 is the oldest reference transaction,
>> reclaims the pages of transaction Xn.
>>
>>
>> T10 Reader attempts to navigate pages of a transaction, Xn, that
>> has been
>> reclaimed and "Bad Things Happen".
>>
>> In fact, bad things will happen if an even number of write
>> transactions commit
>> between W4 and W1, represented by the '*', in the timeline above.
>>
>> The fix is to search for the highest txid in R1 and between R2 and R3.
>> Optionally, followed by recording the txid, of the snapshot root found
>> in R3,
>> in the reader's reader table slot to, possibly, increase the number of
>> reclaimable transactions.
>>
>> The lack of compiler and memory barriers in the implementation of the
>> locking
>> protocol is also of concern.
>>
>> Beyond the above, the code in mdb_txn_renew0() after the
>> "/* Copy the DB info and flags */"
>> comment appears to have a number of data races.
>>
>> ---
>> libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
>> 1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
>> index e0f551e..908417c 100644
>> --- a/libraries/liblmdb/mdb.c
>> +++ b/libraries/liblmdb/mdb.c
>> @@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
>> * started from so we can avoid overwriting any data
>> used in that
>> * particular version.
>> */
>> - txnid_t mrb_txnid;
>> + volatile txnid_t mrb_txnid;
>> /** The process ID of the process owning this reader txn. */
>> - MDB_PID_T mrb_pid;
>> + volatile MDB_PID_T mrb_pid;
>> /** The thread ID of the thread owning this txn. */
>> - pthread_t mrb_tid;
>> + volatile pthread_t mrb_tid;
>> } MDB_rxbody;
>>
>> /** The actual reader record, with cacheline padding. */
>> @@ -585,12 +585,12 @@ typedef struct MDB_txbody {
>> * This is recorded here only for convenience;
>> the value
>> can always
>> * be determined by reading the main database
>> meta pages.
>> */
>> - txnid_t mtb_txnid;
>> + volatile txnid_t mtb_txnid;
>> /** The number of slots that have been used in the
>> reader
>> table.
>> * This always records the maximum count, it is not
>> decremented
>> * when readers release their slots.
>> */
>> - unsigned mtb_numreaders;
>> + volatile unsigned mtb_numreaders;
>> } MDB_txbody;
>>
>> /** The actual reader table definition. */
>> @@ -854,7 +854,7 @@ typedef struct MDB_meta {
>> /** Any persistent environment flags. @ref mdb_env */
>> #define mm_flags mm_dbs[0].md_flags
>> pgno_t mm_last_pg; /**< last
>> used page in
>> file */
>> - txnid_t mm_txnid; /**< txnid that
>> committed this page */
>> + volatile txnid_t mm_txnid; /**< txnid that
>> committed this
>> page */
>> } MDB_meta;
>>
>> /** Buffer for a stack-allocated meta page.
>> @@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
>>
>> if (txn->mt_flags & MDB_TXN_RDONLY) {
>> if (!ti) {
>> + /* No readers table; app responsible for
>> locking */
>> meta = env->me_metas[ mdb_env_pick_meta(env) ];
>> txn->mt_txnid = meta->mm_txnid;
>> txn->mt_u.reader = NULL;
>> } else {
>> + /* Has readers table */
>> MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
>> txn->mt_u.reader :
>> pthread_getspecific(env->me_txkey);
>> if (r) {
>> @@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
>> return rc;
>> }
>> }
>> - txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>> txn->mt_u.reader = r;
>> + r->mr_txnid = ti->mti_txnid;
>> +
>> + /* Really need a memory barrier here */
>> +
>> + txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>> meta = env->me_metas[txn->mt_txnid & 1];
>> }
>> } else {
>> @@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
>> }
>>
>> /* Copy the DB info and flags */
>> - memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
>> + memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db)); /*
>> Racy */
>>
>> /* Moved to here to avoid a data race in read TXNs */
>> - txn->mt_next_pgno = meta->mm_last_pg+1;
>> + txn->mt_next_pgno = meta->mm_last_pg+1; /*
>> Racy */
>>
>> for (i=2; i<txn->mt_numdbs; i++) {
>> x = env->me_dbflags[i];
>>
>
>