7589 – MDB nodesize issues

Issue 7589 - MDB nodesize issues

Summary: MDB nodesize issues

Status:	RESOLVED PARTIAL

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.35
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-05-10 16:23 UTC by Hallvard Furuseth
Modified:	2014-12-11 01:16 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Hallvard Furuseth 2013-05-10 16:23:45 UTC

Full_Name: Hallvard B Furuseth
Version: 2.4.35
OS: Linux x86_64
URL: 
Submission from: (NULL) (2001:700:100:556::233)
Submitted by: hallvard


Nodes of size MDB_env.me_nodemax-1 do not go in overflow pages,
but are too big for two of them to fit in the same page. Because:
- LEAFSIZE() does not round up to a factor of 2.
- MDB_env.me_nodemax is described as /** Max size of a node on a page */
  (i.e. non-overflow page), but actually it's the minimum node size
  which mdb_node_add() *will* put in an overflow page.

I don't see a reason to treat nodes of size 2k-1 and 2k differently
when seen "from the outside", so I think LEAFSIZE should have been

#define LEAFSIZE(k, d) ((1 + NODESIZE + (k)->mv_size + (d)->mv_size) & -2)
Also mdb_leaf_size():
    sz = LEAFSIZE(key, data);
    if (sz >= env->me_nodemax)
        sz = (1 + NODESIZE + sizeof(pgno_t) + key->mv_size) & -2;
    return sz + sizeof(indx_t);
and a corresponding fix in mdb_node_add().

Those changes crash test001.  I don't know what else would need
fixing.  Subpage sizes for one thing.

OTOH if you add a bunch of slightly smaller nodes, mdb will put
most of them in separate pages anyway without MDB_APPEND.  To keep
down rebalancing when adding nodes in the middle?  Or is this
another bug?  So I don't know if more than the me_nodemax doc a
fix.  And maybe a rename to something which fits what it does.

...and also the doc of mdb_branch_size().  "Sizes are always
rounded up to an even number of bytes, to guarantee 2-byte
alignment" happens somewhere, but not in mdb_branch_size() which
happily returns uneven sizes.  But it works fine since it's only
used as SIZELEFT()<mdb_branch_size() where SIZELEFT() does return
an even size. Rounding branch_size up would make no difference.

Comment 1 Hallvard Furuseth 2013-05-14 06:32:52 UTC

I wrote:
> OTOH if you add a bunch of slightly smaller nodes, mdb will put
> most of them in separate pages anyway without MDB_APPEND.

...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9
"ITS#7385 fix mdb_page_split (again)".  When a txn put()s ascending
keys with nodes of the same size, the new item goes in the fullest page.

E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY
DB.  As the txn progresses, (page: #key #key...)  get distributed thus:

Page 2: #1.
Page 2: #1 #2.
Page 2: #1 #2 #3.
Page 2: #1.      Page 3: #2 #3 #4.
Page 2: #1.      Page 3: #2.      Page 5: #3 #4 #5.
Page 2: #1.      Page 3: #2.      Page 5: #3.      Page 6: #4 #5 #6.

2/3 wasted space. Descending put() work better:

Page 2: #6.
Page 2: #5 #6.
Page 2: #4 #5 #6.
Page 2: #3 #4.     Page 3: #5 #6.
Page 2: #2 #3 #4.  Page 3: #5 #6.
Page 2: #2 #1.     Page 3: #5 #6.   Page 5: #3 #4.

Ascending put() with datasize 1348, so only 2 nodes fit in a page:

Page 2: #1.
Page 2: #1 #2.
Page 2: #1.   Page 3: #2 #3.
Page 2: #1.   Page 3: #2.   Page 5: #3 #4.

Half of the space is wasted.  Descending order does not help.

page_split() cannot know which split is best in this case.  But it'll
help to guess that the next put() key sometimes will be near this one,
and ensure that the node with the new key will be the smallest.  That
will also avoid touching the old page when the nodes are that large,
since the "split" will keep all old nodes in the old page.

Trying a simpler patch for this one:

> Nodes of size MDB_env.me_nodemax-1 do not go in overflow pages,
> but are too big for two of them to fit in the same page. (...)

Let mdb_env_open2() round me_nodemax down to an odd number:
  env->me_nodemax = (((env->me_psize - PAGEHDRSZ) / MDB_MINKEYS) - 1) | 1;

Then a leaf page with just 1 node can always get another one, since
mdb_node_add() will always put too big items in overflow pages.
In a DB newer than this patch, anyway...

Maybe mdb_cursor_put() should use the value from before
'offset += offset & 1;' in next line's compare with me_nodemax:

@@ -5128,5 +5128,6 @@
    }
+   i = offset;
    offset += offset & 1;
    if (NODESIZE + sizeof(indx_t) + NODEKSZ(leaf) + NODEDSZ(leaf) +
-       offset >= mc->mc_txn->mt_env->me_nodemax) {
+       i >= mc->mc_txn->mt_env->me_nodemax) {
        /* yes, convert it */

This passes 'make test', with or without changing cursor_put().
But I still don't quite know what I'm doing - otherwise my LEAFSIZE
patch would have worked too.

-- 
Hallvard

Comment 2 Hallvard Furuseth 2013-05-17 17:35:32 UTC

Another issue: This breaks because liblmdb does not
change the node size when overwriting an overflow page:

	mdb_put(txn, dbi, &key, &(MDB_val){9000, val}, 0);
	mdb_put(txn, dbi, &key, &(MDB_val){1,    val}, 0);
	mdb_get(txn, dbi, &key, &data);
	assert(data.mv_size != 9000);

Fixed by 0cdd9dffddf66c730a35f48db2bb02d8bb3e5731.
mdb_cursor_put() said:
	/* yes, overwrite it. Note in this case we don't
	 * bother to try shrinking the node if the new data
	 * is smaller than the overflow threshold.
	 */
It's number of pages which does not need to shrink.

-- 
Hallvard

Comment 3 Hallvard Furuseth 2013-05-17 20:42:25 UTC

changed notes
changed state Open to Test
moved from Incoming to Software Bugs

Comment 4 Quanah Gibson-Mount 2013-07-01 22:13:25 UTC

changed state Test to Partial

Comment 5 Howard Chu 2013-10-12 09:44:51 UTC

changed notes
changed state Partial to Test

Comment 6 Howard Chu 2013-10-12 20:27:22 UTC

h.b.furuseth@usit.uio.no wrote:
> I wrote:
>> OTOH if you add a bunch of slightly smaller nodes, mdb will put
>> most of them in separate pages anyway without MDB_APPEND.
>
> ...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9
> "ITS#7385 fix mdb_page_split (again)".  When a txn put()s ascending
> keys with nodes of the same size, the new item goes in the fullest page.
>
> E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY
> DB.  As the txn progresses, (page: #key #key...)  get distributed thus:
>
> Page 2: #1.
> Page 2: #1 #2.
> Page 2: #1 #2 #3.
> Page 2: #1.      Page 3: #2 #3 #4.
> Page 2: #1.      Page 3: #2.      Page 5: #3 #4 #5.
> Page 2: #1.      Page 3: #2.      Page 5: #3.      Page 6: #4 #5 #6.
>
> 2/3 wasted space. Descending put() work better:
>
> Page 2: #6.
> Page 2: #5 #6.
> Page 2: #4 #5 #6.
> Page 2: #3 #4.     Page 3: #5 #6.
> Page 2: #2 #3 #4.  Page 3: #5 #6.
> Page 2: #2 #1.     Page 3: #5 #6.   Page 5: #3 #4.
>
> Ascending put() with datasize 1348, so only 2 nodes fit in a page:
>
> Page 2: #1.
> Page 2: #1 #2.
> Page 2: #1.   Page 3: #2 #3.
> Page 2: #1.   Page 3: #2.   Page 5: #3 #4.
>
> Half of the space is wasted.  Descending order does not help.
>
> page_split() cannot know which split is best in this case.  But it'll
> help to guess that the next put() key sometimes will be near this one,
> and ensure that the node with the new key will be the smallest.  That
> will also avoid touching the old page when the nodes are that large,
> since the "split" will keep all old nodes in the old page.

Fixed now in mdb.master. On a large DB, the previous code used 3276587 pages 
in slapadd -q, and the new code uses 3272633 pages. It's only a 0.13% savings 
in this case, it seems the frequency of these insert patterns is quite rare. 
The runtime is also 1.1% faster going from
    real 41m35.8s  user 50m57.6s  sys 5m11.4s
to
    real 41m1.7s   user 50m25.8s  sys 4m55.3s

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 Quanah Gibson-Mount 2013-10-14 09:15:44 UTC

changed notes
changed state Test to Release

Comment 8 Quanah Gibson-Mount 2013-10-28 09:27:44 UTC

changed notes
changed state Release to Closed

Comment 9 Hallvard Furuseth 2013-11-25 08:08:20 UTC

changed notes
changed state Closed to Partial

Comment 10 Hallvard Furuseth 2013-11-25 15:04:52 UTC

Howard Chu writes:
> Fixed now in mdb.master. (...)

The me_nodemax issue remains though.  Should be fixed in my branch
"mdb/its7589".


However, another issue:

mdb_node_shrink() can give the node containing the sub-page
uneven size, if MDB_DUPFIXED keys have uneven size:
  mdb_dbi_open(txn, "db", MDB_DUPSORT|MDB_DUPFIXED, &dbi);
  mdb_put(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "A"}, 0);
  mdb_put(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "B"}, 0);
  mdb_del(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "B"});

Fix: Do nothing if the node would be uneven-sized.  Similar to how
mdb_cursor_put() adds room for 4N nodes at a time, with MDB_DUPFIXED.
Dunno if the cursor tracking code in mdb_cursor_del() must still run.

I don't know what to do about databases that already contain such
uneven-sized nodes.  Maybe support reading them (if that needs any
changes), but try to assert when writing.  "Non-LEAF2 nodes are
even-sized" feels like a deeply embedded assumption, but I haven't
really looked.

-- 
Hallvard

Comment 11 Howard Chu 2013-11-25 15:08:46 UTC

Hallvard Breien Furuseth wrote:
  "Non-LEAF2 nodes are
> even-sized" feels like a deeply embedded assumption, but I haven't
> really looked.

Explicitly documented:

     /** Header for a single key/data pair within a page.
      * We guarantee 2-byte alignment for nodes.
      */
typedef struct MDB_node {

Obviously you cannot guarantee 2-byte alignment unless nodes are always even 
sized.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 12 Hallvard Furuseth 2013-11-25 15:13:28 UTC

Howard Chu writes:
> Hallvard Breien Furuseth wrote:
>   "Non-LEAF2 nodes are
>> even-sized" feels like a deeply embedded assumption, but I haven't
>> really looked.
> 
> Explicitly documented:
> 
>      /** Header for a single key/data pair within a page.
>       * We guarantee 2-byte alignment for nodes.
>       */
> typedef struct MDB_node {
> 
> Obviously you cannot guarantee 2-byte alignment unless nodes are always even 
> sized.

Indeed.  But I was wondering how hard to try to rescue databases
where that doc is false.

-- 
Hallvard

Comment 13 Hallvard Furuseth 2013-11-25 15:16:02 UTC

Howard Chu writes:
> Explicitly documented:
> 
>      /** Header for a single key/data pair within a page.
>       * We guarantee 2-byte alignment for nodes.
>       */
> typedef struct MDB_node {

...er, that's false for LEAF2 nodes since we don't reject uneven-sized
DUPFIXED keys.  But that part seems to work fine, it's the non-LEAF2
which breaks.

-- 
Hallvard

Comment 14 Hallvard Furuseth 2013-12-21 16:55:49 UTC

changed notes
changed state Partial to Test

Comment 15 OpenLDAP project 2014-12-11 01:16:05 UTC

Fixed in mdb.master
Some fixes in master
Some fixes in RE24
Some fixes in RE25

Comment 16 Quanah Gibson-Mount 2014-12-11 01:16:05 UTC

changed notes
changed state Test to Partial