8209 – Broken MDB_CP_COMPACT threading

Issue 8209 - Broken MDB_CP_COMPACT threading

Summary: Broken MDB_CP_COMPACT threading

Status:	RESOLVED TEST

Alias:	None

Product:	LMDB
Classification:	Unclassified
Component:	liblmdb (show other issues)
Version:	0.9.15
Hardware:	All All

Importance:	--- normal
Target Milestone:	1.0.0
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-07-29 19:25 UTC by Hallvard Furuseth
Modified:	2020-03-16 22:49 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Hallvard Furuseth 2015-07-29 19:25:59 UTC

Full_Name: Hallvard B Furuseth
Version: LMDB_0.9.15
OS: 
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (81.191.45.5)
Submitted by: hallvard


mdb_env_copyfd1() and mdb_env_copythr() synchronize via my->mc_new,
but incorrectly.

copythr() starts before there is any data to consume, and copyfd1()
does not tell it (by setting mc_new and signalling) when there is
data.  Instead it sets mc_new before it has finished producing the
data.  I suppose copythr just waits for a spurious wakeup, I haven't
looked closely.  It got it right originally when copyfd1 did not start
the new thread until data (metapages) was ready.  Or copyfd1 could
start with mc_new = 0, and set it and signal when data is ready.

copythr holds the mutex while writing.  That's not how conds are
supposed to be used.  Threads are supposed to grab the mutex briefly
around code which operates on mc_new.

mc_status can be lost.  Copyfd1 does not always use it, or it discards
it.  And copythr should set mc_new=0 before exiting also when setting
mc_status. so copyfd1 won't just sit and wait.

Comment 1 Hallvard Furuseth 2015-07-29 19:31:05 UTC

On 29/07/15 21:25, h.b.furuseth@usit.uio.no wrote:
>  (...)And copythr should set mc_new=0 before exiting also when setting
> mc_status. so copyfd1 won't just sit and wait.

I forgot it clears mc_new before setting mc_status.  It's if it didn't
keep the mutex it would have to set mc_new afterwards (after locking
the mutex).

Comment 2 Hallvard Furuseth 2015-08-03 13:03:05 UTC

moved from Incoming to Software Bugs

Comment 3 Hallvard Furuseth 2015-08-03 13:06:53 UTC

Okay, that was one confused ITS in response to equally
confused code.  Trying again.   Draft fixes in my updated
branch "mdb/copyfd1" (non-fast-forward).


It's broken if mdb_env_copythr() has a write error:

* copyfd1 can (A) overrun buffer mc_wlen, and (B) hang, if
   if cthr_toggle() is in cond_wait() when the error occurs:

   It sets mc_new and returns success.  cwalk() proceeds to write
   at mc_wbuf+mc_wlen, but copythr() did not reset mc_wlen. Then
   copyfd1() waits for the now-dead copythr() to clear mc_new.

   Fixes:
   (A) join the if(mc_status) and while(mc_new==1) in cthr_toggle.
   (B) Drop the wait before THREAD_FINISH(), it seems pointless.
       Both Posix and Windows doc say that THREAD_FINISH waits.
       Or if Windows needs it for some reason, then copythr must
       live until it receives mc_new=-1, so it can set mc_new=0.

* copyfd1() loses any error from the final cthr_toggle().

   Fix: Return rc ? rc : my.mc_status.

Tested on Linux with fake write errors, with/without fixes.


While not a bug, I also see no point in both threads' initial
waits.  copyfd1() can own the buffer initially (set mc_new=0).
I'm guessing the waits are there because the code tries to guide
the threads from step to step, but it broke because you didn't
think of every possibility.  It's easier to define and maintain
one cond_wait state per thread and keep things consistent there.


More on error checking:

copyfd1() omits most error checking and the cleanup after errors.
E.g. it does not tell copythr to finish if txn_begin fails.

Each thread is slow to react to errors in the other thread. I/O
will cease quicker if both threads set and check mc_status.  If
so, mc_status can no longer be part of the cond_wait condition,
since it won't be mutex-protected.  So the threads might chat a
bit more before finishing.

I'd rename mc_status -> mc_error. "while (!my->mc_error && ...)"
is quite self-explanatory, unlike "while (!my->mc_status && ...)".


About lock/unlock outside vs. inside the loop:
When outside, copythr() must block for the threads to exchange
buffers.  When inside, the slowest thread does not block, but
there are more lock/unlock/wait calls.  Testing: Locks inside is
clearly faster on tmpfs.  On a regular FS I don't know - it may
have been a bit slower, but times varied more than the difference.


Finally, cwalk() could re-use the page buffers it mallocs.

-- 
Hallvard

Comment 4 Hallvard Furuseth 2015-08-28 21:11:26 UTC

More:

MDB_CP_COMPACT would drop a new, empty main DB.
(I.e. the main DB would lose its flags.)

Compacting can verify that the written metapage claims
enough DB pages.  But not the opposite - it should not
fail if it claims too many pages since that would mean
we can't compact a DB which has suffered page leaks.

We could detect that and re-write the metas at the end,
if the user requests it and the file is seekable.

Comment 5 Hallvard Furuseth 2015-09-08 12:23:24 UTC

I wrote:
> Compacting can verify that the written metapage claims
> enough DB pages.  But not the opposite - it should not
> fail if it claims too many pages since that would mean
> we can't compact a DB which has suffered page leaks.

Duh, yes the opposite too.  If there has been a page leak,
the root page number written in the new meta will be wrong.

> We could detect that and re-write the metas at the end,
> if the user requests it and the file is seekable.

Though there's no reliable "is file seekable?" test. lseek()
can do nothing but still return success on some devices.

Another (user) option:  Write a file which requires recovery.
We'll need something like that with incremental backups anyway.
Write dummy metapages at the beginning, with meta.mm_version
indicating the file must be recovered.  Write a final metapage
at the end, or try to seek and rewrite the beginning if the
user says so.

Recovery must be a separate program, or in mdb_env_open()
when we know the metapage is at the end - i.e. before mapping
and maybe expanding the file.

-- 
Hallvard

Comment 6 Hallvard Furuseth 2016-06-28 10:15:14 UTC

changed notes
changed state Open to Test

Comment 7 OpenLDAP project 2016-06-28 11:05:45 UTC

Some fixes in mdb.master

Comment 8 Hallvard Furuseth 2016-06-28 11:05:45 UTC

changed notes