Issue 8392 - Opening multiple databases from the same process
Summary: Opening multiple databases from the same process
Status: VERIFIED INVALID
Alias: None
Product: LMDB
Classification: Unclassified
Component: liblmdb (show other issues)
Version: 0.9.18
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-26 17:05 UTC by matthew@wellquite.org
Modified: 2021-06-02 15:05 UTC (History)
0 users

See Also:


Attachments
log (19.96 KB, text/plain)
2016-03-26 22:42 UTC, matthew@wellquite.org
Details
log (16.51 KB, text/plain)
2016-03-26 21:07 UTC, matthew@wellquite.org
Details

Note You need to log in before you can comment on or make changes to this issue.
Description matthew@wellquite.org 2016-03-26 17:05:38 UTC
Full_Name: Matthew Sackman
Version: LMDB 0.9.18
OS: Linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (109.149.185.170)


Linux koba 4.4.0-1-amd64 #1 SMP Debian 4.4.6-1 (2016-03-17) x86_64 GNU/Linux
LMDB 0.9.18 (compiled locally)

I'm currently writing a program that iterates through multiple LMDB databases in
order to check differences between them. On each database I have one
long-running read-only txn with a cursor, and I'm walking through all the
databases at the same time.

I'm seeing some very odd behaviour. For example, cursor.Get with FIRST is not
always returning the first item in the table (which I happen to know exactly
what it should be). The behaviour is non-deterministic - different runs produce
different results.

I've searched through the docs and I can't find any statements regarding opening
multiple dbs from the same process. Is this supported?

I'm writing this in Go, using my own bindings. I'm pretty certain that I'm doing
all the threading correctly and I certainly get this same issue regardless of
whether I'm using NOTLS or not (with my bindings, it shouldn't be necessary
because I'm locking go-routines to os-threads), plus these are the same bindings
that I've used throughout GoshawkDB and I've not come across this problem until
now.

I can guarantee that each DB that I open will be opened in a different OS thread
and that no single OS thread will ever access different DBs.
Comment 1 Howard Chu 2016-03-26 18:44:29 UTC
matthew@wellquite.org wrote:
> Full_Name: Matthew Sackman
> Version: LMDB 0.9.18
> OS: Linux
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (109.149.185.170)
>
>
> Linux koba 4.4.0-1-amd64 #1 SMP Debian 4.4.6-1 (2016-03-17) x86_64 GNU/Linux
> LMDB 0.9.18 (compiled locally)
>
> I'm currently writing a program that iterates through multiple LMDB databases in
> order to check differences between them. On each database I have one
> long-running read-only txn with a cursor, and I'm walking through all the
> databases at the same time.
>
> I'm seeing some very odd behaviour. For example, cursor.Get with FIRST is not
> always returning the first item in the table (which I happen to know exactly
> what it should be). The behaviour is non-deterministic - different runs produce
> different results.
>
> I've searched through the docs and I can't find any statements regarding opening
> multiple dbs from the same process. Is this supported?
>
> I'm writing this in Go, using my own bindings. I'm pretty certain that I'm doing
> all the threading correctly and I certainly get this same issue regardless of
> whether I'm using NOTLS or not (with my bindings, it shouldn't be necessary
> because I'm locking go-routines to os-threads), plus these are the same bindings
> that I've used throughout GoshawkDB and I've not come across this problem until
> now.
>
> I can guarantee that each DB that I open will be opened in a different OS thread
> and that no single OS thread will ever access different DBs.

Is this entire task read-only, nobody is modifying any of the DBs while these 
readers are iterating thru the DBs?

I've only used Go very briefly. I would be quite suspicious of that goroutine 
behavior but don't have any insight there.

You might consider compiling LMDB with MDB_DEBUG defined, enabling debug 
output, and then watching the cursor activity. Of course, the act of writing 
debug messages will impact thread scheduling, so it may not reveal what's 
going on.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 2 matthew@wellquite.org 2016-03-26 21:07:28 UTC
Hi Howard,

Thanks for the quick reply.

On Sat, Mar 26, 2016 at 06:44:29PM +0000, Howard Chu wrote:
> Is this entire task read-only, nobody is modifying any of the DBs while
> these readers are iterating thru the DBs?

This entire process is read only, and indeed the transactions that
contain cursors are marked RDONLY. No other process is running that
accesses any of these databases.

> You might consider compiling LMDB with MDB_DEBUG defined, enabling debug
> output, and then watching the cursor activity. Of course, the act of writing
> debug messages will impact thread scheduling, so it may not reveal what's
> going on.

Ok, I've done all that and got debug output going. I can still provoke
the same problem easily. A log is attached.
Ultimately, both cursors should start on a key
0000000000000000000000000000000000000000, but the second cursor/db
discovers it actually starts on
0000000000000000000000000000000141d8872b.

One problem is that the debug output gives no indication of which thread
is which so it seems pretty tricky to separate out the different calls.
Do you have any hints or tricks for doing this?

Matthew
Comment 3 matthew@wellquite.org 2016-03-26 22:42:09 UTC
> One problem is that the debug output gives no indication of which thread
> is which so it seems pretty tricky to separate out the different calls.
> Do you have any hints or tricks for doing this?

Ok, so I altered the DPRINTF so that it gets the current thread id too:
  fprintf(stderr, "%s:%d:%lu " fmt "\n", mdb_func_, __LINE__, syscall(__NR_gettid), __VA_ARGS__)
I also added some DPUTS to the start of every function in the API.

New log attached. Essentially, there are four threads of interest:
        7036 - write txns for /tmp/data1
        7039 - read txns (incl cursor) for /tmp/data1
        7040 - write txns for /tmp/data2
        7041 - read txns (incl cursor) for /tmp/data2

As there are no writes going on, the write threads do no work once the
envs and dbis are created and opened.

In this run, it's 7039 that finds the usual problem. If we just grep for
7039 from the log, we get (edited highlights):

mdb_txn_begin:2860:7039 begin txn 1862r 0x7f4034000a10 on mdbenv 0xd483c0, root page 417
...
mdb_get:5570:7039 mdb_get
mdb_get:5576:7039 ===> get db 4 key [00000000000000010000000000000001311e7dc5]
...
mdb_cursor_set:6002:7039 ==> cursor placed on key [00000000000000010000000000000001311e7dc5]
mdb_txn_reset:2990:7039 mdb_txn_reset
mdb_txn_end:2919:7039 mdb_txn_end
mdb_txn_end:2931:7039 reset txn 1862r 0x7f4034000a10 on mdbenv 0xd483c0, root page 417

... now we start up /tmp/data2 in the other threads.
    When we next see 7039 return:

mdb_txn_renew:2749:7039 mdb_txn_renew
mdb_txn_renew:2759:7039 renew txn 1862r 0x7f4034000a10 on mdbenv 0xd483c0, root page 417
mdb_cursor_open:7408:7039 mdb_cursor_open
mdb_page_search:5445:7039 db 1 root page 417 has flags 0x2
mdb_page_search_root:5342:7039 found leaf page 417 for key [5472616e73616374696f6e73]
mdb_node_search:5091:7039 searching 5 keys in leaf page 417
mdb_node_search:5135:7039 found leaf index 2 [5472616e73616374696f6e52656673], rc = 8448
mdb_node_search:5135:7039 found leaf index 3 [5472616e73616374696f6e73], rc = 0
mdb_page_search:5445:7039 db 4 root page 276 has flags 0x1
mdb_cursor_get:6103:7039 mdb_cursor_get
mdb_cursor_first:6011:7039 mdb_cursor_first
mdb_page_search:5445:7039 db 4 root page 276 has flags 0x1
mdb_page_search_root:5290:7039 branch page 276 has 2 keys
mdb_page_search_root:5296:7039 found index 0 to page 710
mdb_cursor_push:5202:7039 pushing page 710 on db 4 cursor 0x7f4034000bd0
mdb_page_search_root:5290:7039 branch page 710 has 134 keys
mdb_page_search_root:5296:7039 found index 0 to page 43
mdb_cursor_push:5202:7039 pushing page 43 on db 4 cursor 0x7f4034000bd0
mdb_page_search_root:5342:7039 found leaf page 43 for key [null]

... at this point the error is detected and we start tidying up.

From the tids that are being logged, it doesn't appear to me that
anything's up with the threading. It looks like there's some issue with
opening a cursor in a read only txn that has already been reset and
renewed.

The strange thing is that this only happens when this program is running
across more than one database. With a single database it never goes
wrong. So it seems like there's some interaction between opening a
second db and an existing db (with reset txns?).

Many thanks for your time.

Matthew
Comment 4 Howard Chu 2016-03-26 23:31:58 UTC
Nothing immediately obvious jumps out. You might try adding some 
mdb_page_list() calls in here, maybe at each point where it pushes a new page 
onto the cursor.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 5 matthew@wellquite.org 2016-03-27 07:17:51 UTC
Fixed. The short version is it's definitely a problem of my own making,
though it's not related to threading. It's an oversight in the design of
my own bindings which only causes an issue when you open multiple
databases at the same time. I shall be able to fix without a problem in
my bindings.

Thank you for your help, and sorry for wasting your time.

Kind regards,

Matthew