Full_Name: Hallvard B Furuseth Version: 2.4.35 OS: Linux x86_64 URL: Submission from: (NULL) (2001:700:100:556::233) Submitted by: hallvard Nodes of size MDB_env.me_nodemax-1 do not go in overflow pages, but are too big for two of them to fit in the same page. Because: - LEAFSIZE() does not round up to a factor of 2. - MDB_env.me_nodemax is described as /** Max size of a node on a page */ (i.e. non-overflow page), but actually it's the minimum node size which mdb_node_add() *will* put in an overflow page. I don't see a reason to treat nodes of size 2k-1 and 2k differently when seen "from the outside", so I think LEAFSIZE should have been #define LEAFSIZE(k, d) ((1 + NODESIZE + (k)->mv_size + (d)->mv_size) & -2) Also mdb_leaf_size(): sz = LEAFSIZE(key, data); if (sz >= env->me_nodemax) sz = (1 + NODESIZE + sizeof(pgno_t) + key->mv_size) & -2; return sz + sizeof(indx_t); and a corresponding fix in mdb_node_add(). Those changes crash test001. I don't know what else would need fixing. Subpage sizes for one thing. OTOH if you add a bunch of slightly smaller nodes, mdb will put most of them in separate pages anyway without MDB_APPEND. To keep down rebalancing when adding nodes in the middle? Or is this another bug? So I don't know if more than the me_nodemax doc a fix. And maybe a rename to something which fits what it does. ...and also the doc of mdb_branch_size(). "Sizes are always rounded up to an even number of bytes, to guarantee 2-byte alignment" happens somewhere, but not in mdb_branch_size() which happily returns uneven sizes. But it works fine since it's only used as SIZELEFT()<mdb_branch_size() where SIZELEFT() does return an even size. Rounding branch_size up would make no difference.
I wrote: > OTOH if you add a bunch of slightly smaller nodes, mdb will put > most of them in separate pages anyway without MDB_APPEND. ...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9 "ITS#7385 fix mdb_page_split (again)". When a txn put()s ascending keys with nodes of the same size, the new item goes in the fullest page. E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY DB. As the txn progresses, (page: #key #key...) get distributed thus: Page 2: #1. Page 2: #1 #2. Page 2: #1 #2 #3. Page 2: #1. Page 3: #2 #3 #4. Page 2: #1. Page 3: #2. Page 5: #3 #4 #5. Page 2: #1. Page 3: #2. Page 5: #3. Page 6: #4 #5 #6. 2/3 wasted space. Descending put() work better: Page 2: #6. Page 2: #5 #6. Page 2: #4 #5 #6. Page 2: #3 #4. Page 3: #5 #6. Page 2: #2 #3 #4. Page 3: #5 #6. Page 2: #2 #1. Page 3: #5 #6. Page 5: #3 #4. Ascending put() with datasize 1348, so only 2 nodes fit in a page: Page 2: #1. Page 2: #1 #2. Page 2: #1. Page 3: #2 #3. Page 2: #1. Page 3: #2. Page 5: #3 #4. Half of the space is wasted. Descending order does not help. page_split() cannot know which split is best in this case. But it'll help to guess that the next put() key sometimes will be near this one, and ensure that the node with the new key will be the smallest. That will also avoid touching the old page when the nodes are that large, since the "split" will keep all old nodes in the old page. Trying a simpler patch for this one: > Nodes of size MDB_env.me_nodemax-1 do not go in overflow pages, > but are too big for two of them to fit in the same page. (...) Let mdb_env_open2() round me_nodemax down to an odd number: env->me_nodemax = (((env->me_psize - PAGEHDRSZ) / MDB_MINKEYS) - 1) | 1; Then a leaf page with just 1 node can always get another one, since mdb_node_add() will always put too big items in overflow pages. In a DB newer than this patch, anyway... Maybe mdb_cursor_put() should use the value from before 'offset += offset & 1;' in next line's compare with me_nodemax: @@ -5128,5 +5128,6 @@ } + i = offset; offset += offset & 1; if (NODESIZE + sizeof(indx_t) + NODEKSZ(leaf) + NODEDSZ(leaf) + - offset >= mc->mc_txn->mt_env->me_nodemax) { + i >= mc->mc_txn->mt_env->me_nodemax) { /* yes, convert it */ This passes 'make test', with or without changing cursor_put(). But I still don't quite know what I'm doing - otherwise my LEAFSIZE patch would have worked too. -- Hallvard
Another issue: This breaks because liblmdb does not change the node size when overwriting an overflow page: mdb_put(txn, dbi, &key, &(MDB_val){9000, val}, 0); mdb_put(txn, dbi, &key, &(MDB_val){1, val}, 0); mdb_get(txn, dbi, &key, &data); assert(data.mv_size != 9000); Fixed by 0cdd9dffddf66c730a35f48db2bb02d8bb3e5731. mdb_cursor_put() said: /* yes, overwrite it. Note in this case we don't * bother to try shrinking the node if the new data * is smaller than the overflow threshold. */ It's number of pages which does not need to shrink. -- Hallvard
changed notes changed state Open to Test moved from Incoming to Software Bugs
changed state Test to Partial
changed notes changed state Partial to Test
h.b.furuseth@usit.uio.no wrote: > I wrote: >> OTOH if you add a bunch of slightly smaller nodes, mdb will put >> most of them in separate pages anyway without MDB_APPEND. > > ...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9 > "ITS#7385 fix mdb_page_split (again)". When a txn put()s ascending > keys with nodes of the same size, the new item goes in the fullest page. > > E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY > DB. As the txn progresses, (page: #key #key...) get distributed thus: > > Page 2: #1. > Page 2: #1 #2. > Page 2: #1 #2 #3. > Page 2: #1. Page 3: #2 #3 #4. > Page 2: #1. Page 3: #2. Page 5: #3 #4 #5. > Page 2: #1. Page 3: #2. Page 5: #3. Page 6: #4 #5 #6. > > 2/3 wasted space. Descending put() work better: > > Page 2: #6. > Page 2: #5 #6. > Page 2: #4 #5 #6. > Page 2: #3 #4. Page 3: #5 #6. > Page 2: #2 #3 #4. Page 3: #5 #6. > Page 2: #2 #1. Page 3: #5 #6. Page 5: #3 #4. > > Ascending put() with datasize 1348, so only 2 nodes fit in a page: > > Page 2: #1. > Page 2: #1 #2. > Page 2: #1. Page 3: #2 #3. > Page 2: #1. Page 3: #2. Page 5: #3 #4. > > Half of the space is wasted. Descending order does not help. > > page_split() cannot know which split is best in this case. But it'll > help to guess that the next put() key sometimes will be near this one, > and ensure that the node with the new key will be the smallest. That > will also avoid touching the old page when the nodes are that large, > since the "split" will keep all old nodes in the old page. Fixed now in mdb.master. On a large DB, the previous code used 3276587 pages in slapadd -q, and the new code uses 3272633 pages. It's only a 0.13% savings in this case, it seems the frequency of these insert patterns is quite rare. The runtime is also 1.1% faster going from real 41m35.8s user 50m57.6s sys 5m11.4s to real 41m1.7s user 50m25.8s sys 4m55.3s -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Test to Release
changed notes changed state Release to Closed
changed notes changed state Closed to Partial
Howard Chu writes: > Fixed now in mdb.master. (...) The me_nodemax issue remains though. Should be fixed in my branch "mdb/its7589". However, another issue: mdb_node_shrink() can give the node containing the sub-page uneven size, if MDB_DUPFIXED keys have uneven size: mdb_dbi_open(txn, "db", MDB_DUPSORT|MDB_DUPFIXED, &dbi); mdb_put(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "A"}, 0); mdb_put(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "B"}, 0); mdb_del(txn, dbi, &(MDB_val){4, "key"}, &(MDB_val){1, "B"}); Fix: Do nothing if the node would be uneven-sized. Similar to how mdb_cursor_put() adds room for 4N nodes at a time, with MDB_DUPFIXED. Dunno if the cursor tracking code in mdb_cursor_del() must still run. I don't know what to do about databases that already contain such uneven-sized nodes. Maybe support reading them (if that needs any changes), but try to assert when writing. "Non-LEAF2 nodes are even-sized" feels like a deeply embedded assumption, but I haven't really looked. -- Hallvard
Hallvard Breien Furuseth wrote: "Non-LEAF2 nodes are > even-sized" feels like a deeply embedded assumption, but I haven't > really looked. Explicitly documented: /** Header for a single key/data pair within a page. * We guarantee 2-byte alignment for nodes. */ typedef struct MDB_node { Obviously you cannot guarantee 2-byte alignment unless nodes are always even sized. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu writes: > Hallvard Breien Furuseth wrote: > "Non-LEAF2 nodes are >> even-sized" feels like a deeply embedded assumption, but I haven't >> really looked. > > Explicitly documented: > > /** Header for a single key/data pair within a page. > * We guarantee 2-byte alignment for nodes. > */ > typedef struct MDB_node { > > Obviously you cannot guarantee 2-byte alignment unless nodes are always even > sized. Indeed. But I was wondering how hard to try to rescue databases where that doc is false. -- Hallvard
Howard Chu writes: > Explicitly documented: > > /** Header for a single key/data pair within a page. > * We guarantee 2-byte alignment for nodes. > */ > typedef struct MDB_node { ...er, that's false for LEAF2 nodes since we don't reject uneven-sized DUPFIXED keys. But that part seems to work fine, it's the non-LEAF2 which breaks. -- Hallvard
Fixed in mdb.master Some fixes in master Some fixes in RE24 Some fixes in RE25
changed notes changed state Test to Partial