[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#7589) MDB nodesize issues



I wrote:
> OTOH if you add a bunch of slightly smaller nodes, mdb will put
> most of them in separate pages anyway without MDB_APPEND.

...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9
"ITS#7385 fix mdb_page_split (again)".  When a txn put()s ascending
keys with nodes of the same size, the new item goes in the fullest page.

E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY
DB.  As the txn progresses, (page: #key #key...)  get distributed thus:

Page 2: #1.
Page 2: #1 #2.
Page 2: #1 #2 #3.
Page 2: #1.      Page 3: #2 #3 #4.
Page 2: #1.      Page 3: #2.      Page 5: #3 #4 #5.
Page 2: #1.      Page 3: #2.      Page 5: #3.      Page 6: #4 #5 #6.

2/3 wasted space. Descending put() work better:

Page 2: #6.
Page 2: #5 #6.
Page 2: #4 #5 #6.
Page 2: #3 #4.     Page 3: #5 #6.
Page 2: #2 #3 #4.  Page 3: #5 #6.
Page 2: #2 #1.     Page 3: #5 #6.   Page 5: #3 #4.

Ascending put() with datasize 1348, so only 2 nodes fit in a page:

Page 2: #1.
Page 2: #1 #2.
Page 2: #1.   Page 3: #2 #3.
Page 2: #1.   Page 3: #2.   Page 5: #3 #4.

Half of the space is wasted.  Descending order does not help.

page_split() cannot know which split is best in this case.  But it'll
help to guess that the next put() key sometimes will be near this one,
and ensure that the node with the new key will be the smallest.  That
will also avoid touching the old page when the nodes are that large,
since the "split" will keep all old nodes in the old page.


Trying a simpler patch for this one:

> Nodes of size MDB_env.me_nodemax-1 do not go in overflow pages,
> but are too big for two of them to fit in the same page. (...)

Let mdb_env_open2() round me_nodemax down to an odd number:
  env->me_nodemax = (((env->me_psize - PAGEHDRSZ) / MDB_MINKEYS) - 1) | 1;

Then a leaf page with just 1 node can always get another one, since
mdb_node_add() will always put too big items in overflow pages.
In a DB newer than this patch, anyway...

Maybe mdb_cursor_put() should use the value from before
'offset += offset & 1;' in next line's compare with me_nodemax:

@@ -5128,5 +5128,6 @@
    }
+   i = offset;
    offset += offset & 1;
    if (NODESIZE + sizeof(indx_t) + NODEKSZ(leaf) + NODEDSZ(leaf) +
-       offset >= mc->mc_txn->mt_env->me_nodemax) {
+       i >= mc->mc_txn->mt_env->me_nodemax) {
        /* yes, convert it */

This passes 'make test', with or without changing cursor_put().
But I still don't quite know what I'm doing - otherwise my LEAFSIZE
patch would have worked too.

-- 
Hallvard