Full_Name: Nic Watson Version: LMDB v 0.9.21 OS: Ubuntu 17.04 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (108.56.136.246) I'm getting a seg fault in using LMDB on a database opened with MDB_DUPSORT. Here's a minimal set of operations that will cause the problem: #include <stdio.h> #include "lmdb.h" #define ARRAY_SIZE(array) (sizeof(array) / sizeof(*array)) void cause_crash() { MDB_env *env; MDB_txn *txn; MDB_dbi dbi; int rc; int line_fail = 0; MDB_val key = {.mv_size = 1, .mv_data = "a"}; MDB_val vals[] = { { .mv_size = 8, .mv_data = "\x00\x00\x00\x00\x00\x00\x00\x00" }, { .mv_size = 1, .mv_data = "\x05" }, { .mv_size = 1, .mv_data = "\t" }, { .mv_size = 1, .mv_data = "\r" }, { .mv_size = 1, .mv_data = "\x11" }, { .mv_size = 1, .mv_data = "\x15" }, { .mv_size = 1, .mv_data = "\x19" }, { .mv_size = 1, .mv_data = "\x1d" }, { .mv_size = 1, .mv_data = "!" }, { .mv_size = 1, .mv_data = "%" }, { .mv_size = 1, .mv_data = ")" }, { .mv_size = 1, .mv_data = "-" }, { .mv_size = 1, .mv_data = "1" }, { .mv_size = 1, .mv_data = "5" }, }; rc = mdb_env_create(&env); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_env_set_maxdbs(env, 2); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_env_open(env, "foo.lmdb", 0, 0777); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_txn_begin(env, NULL, 0, &txn); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_dbi_open(txn, "another_db", MDB_CREATE | MDB_DUPSORT, &dbi); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_txn_commit(txn); if (rc) { line_fail = __LINE__; goto fail; } rc = mdb_txn_begin(env, NULL, 0, &txn); if (rc) { line_fail = __LINE__; goto fail; } for (int i = 0; i < ARRAY_SIZE(vals); i++) { rc = mdb_put(txn, dbi, &key, &vals[i], 0); if (rc) { line_fail = __LINE__; goto fail; } } rc = mdb_txn_commit(txn); if (rc) { line_fail = __LINE__; goto fail; } mdb_dbi_close(env, dbi); mdb_env_close(env); return; fail: printf("Failed with error %d on line %d\n", rc, line_fail); } int main(int argc, char **argv) { printf("Running %s\n", MDB_VERSION_STRING); cause_crash(); printf("Didn't crash!\n"); return 0; } Here's a simple sh make script: LMDB_DIR=~/src/lmdb/libraries/liblmdb gcc-6 -O3 -ggdb -std=c11 -Wall -I $LMDB_DIR lmdb_crash2.c $LMDB_DIR/{mdb.c,midl.c} -lpthread -o mdb_c_exe mkdir -p foo.lmdb Here's a basic stack trace: Program received signal SIGSEGV, Segmentation fault. 0x000055555555ee7c in mdb_cursor_put (mc=mc@entry=0x7fffffffcfa0, key=key@entry=0x7fffffffd380, data=data@entry=0x7fffffffd460, flags=flags@entry=0) at /home/nic/src/lmdb/libraries/liblmdb/mdb.c:6792 6792 mp->mp_ptrs[i] = fp->mp_ptrs[i] + offset; (gdb) bt #0 0x000055555555ee7c in mdb_cursor_put (mc=mc@entry=0x7fffffffcfa0, key=key@entry=0x7fffffffd380, data=data@entry=0x7fffffffd460, flags=flags@entry=0) at /home/nic/src/lmdb/libraries/liblmdb/mdb.c:6792 #1 0x00005555555635d0 in mdb_cursor_put (flags=0, data=0x7fffffffd460, key=0x7fffffffd380, mc=0x7fffffffcfa0) at /home/nic/src/lmdb/libraries/liblmdb/mdb.c:8984 #2 mdb_put (txn=0x55555576e8d0, It will *not* crash under debug. In fact, -O3 -fvect-cost-model=cheap will *not* crash. This makes some sense since it is crashing on an SSE instruction. I tried gcc versions (Ubuntu 7.2.0-8ubuntu3.2) and (Ubuntu 6.4.0-8ubuntu1) with the same result. I also tried with the mdb.master branch (0.9.70) with the same result. I'm not convinced this a fault in your code. It may be a gcc bug.
github@nicwatson.org wrote: > Full_Name: Nic Watson > Version: LMDB v 0.9.21 > OS: Ubuntu 17.04 > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (108.56.136.246) > > > I'm getting a seg fault in using LMDB on a database opened with MDB_DUPSORT. > > Here's a minimal set of operations that will cause the problem: > It will *not* crash under debug. In fact, -O3 -fvect-cost-model=cheap will > *not* crash. This makes some sense since it is crashing on an SSE instruction. > > I tried gcc versions (Ubuntu 7.2.0-8ubuntu3.2) and (Ubuntu 6.4.0-8ubuntu1) with > the same result. I also tried with the mdb.master branch (0.9.70) with the same > result. > > I'm not convinced this a fault in your code. It may be a gcc bug. Interesting. I've got gcc 5.4.0 on Ubuntu 16.04 here, and no crash. Also rather puzzled that there's anything vectorizable in this code, that seems pretty unlikely. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com wrote: > github@nicwatson.org wrote: >> Full_Name: Nic Watson >> Version: LMDB v 0.9.21 >> OS: Ubuntu 17.04 >> URL: ftp://ftp.openldap.org/incoming/ >> Submission from: (NULL) (108.56.136.246) >> >> >> I'm getting a seg fault in using LMDB on a database opened with MDB_DUPSORT. >> >> Here's a minimal set of operations that will cause the problem: > >> It will *not* crash under debug. In fact, -O3 -fvect-cost-model=cheap will >> *not* crash. This makes some sense since it is crashing on an SSE instruction. >> >> I tried gcc versions (Ubuntu 7.2.0-8ubuntu3.2) and (Ubuntu 6.4.0-8ubuntu1) with >> the same result. I also tried with the mdb.master branch (0.9.70) with the same >> result. >> >> I'm not convinced this a fault in your code. It may be a gcc bug. > > Interesting. I've got gcc 5.4.0 on Ubuntu 16.04 here, and no crash. Also > rather puzzled that there's anything vectorizable in this code, that seems > pretty unlikely. > I take it back - was able to reproduce the crash with gcc 5.4.0. The crash is on this instruction 0x000000000040f26b <+4235>: add $0x1,%edi => 0x000000000040f26e <+4238>: movdqa (%rax,%rdx,1),%xmm1 0x000000000040f273 <+4243>: mov (%rsp),%rax 0x000000000040f277 <+4247>: paddw %xmm0,%xmm1 movdqa's description is "Move aligned packed integer values" and the values here are not aligned. (gdb) bt #0 0x000000000040f26e in mdb_cursor_put (mc=mc@entry=0x7fffffffdd00, key=key@entry=0x7fffffffe0e0, data=data@entry=0x7fffffffe1c0, flags=flags@entry=0) at mdb.c:7673 #1 0x0000000000411d64 in mdb_cursor_put (flags=0, data=0x7fffffffe1c0, key=0x7fffffffe0e0, mc=0x7fffffffdd00) at mdb.c:9867 #2 mdb_put (txn=0x61e680, dbi=2, key=key@entry=0x7fffffffe0e0, data=data@entry=0x7fffffffe1c0, flags=flags@entry=0) at mdb.c:9868 #3 0x0000000000401c13 in cause_crash () at its8819.c:53 #4 0x0000000000401991 in main (argc=<optimized out>, argv=<optimized out>) at its8819.c:72 (gdb) l 7668 memcpy(METADATA(mp), METADATA(fp), NUMKEYS(fp) * fp->mp_pad); 7669 } else { 7670 memcpy((char *)mp + mp->mp_upper + PAGEBASE, (char *)fp + fp->mp_upper + PAGEBASE, 7671 olddata.mv_size - fp->mp_upper - PAGEBASE); 7672 for (i=0; i<NUMKEYS(fp); i++) 7673 mp->mp_ptrs[i] = fp->mp_ptrs[i] + offset; 7674 } 7675 } 7676 7677 rdata = &xdata; In particular, the fp pointer is on an odd address. Basically the movdqa instruction is not valid for use here. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Summary of findings so far: Attached is a make script that applies the minimal set of optimizations (at least on gcc 7) that still triggers the seg fault. It fails the first time that address is executed, when NUMKEYS(fp) hits 10 or 11 (I *think* depending on the alignment of fp). It looks like the autovectorization logic is split in three parts: one unaligned non-SSE copy, an aligned copy that uses SSE, and then an unaligned non-SSE copy. It makes sense that the SSE logic path is only triggered when the size is greater than some threshold. We should coordinate filing a gcc bug report. Nic Watson On Fri, Mar 16, 2018 at 3:11 PM, <openldap-its@openldap.org> wrote: > > *** THIS IS AN AUTOMATICALLY GENERATED REPLY *** > > Thanks for your report to the OpenLDAP Issue Tracking System. Your > report has been assigned the tracking number ITS#8819. > > One of our support engineers will look at your report in due course. > Note that this may take some time because our support engineers > are volunteers. They only work on OpenLDAP when they have spare > time. > > If you need to provide additional information in regards to your > issue report, you may do so by replying to this message. Note that > any mail sent to openldap-its@openldap.org with (ITS#8819) > in the subject will automatically be attached to the issue report. > > mailto:openldap-its@openldap.org?subject=(ITS#8819) > > You may follow the progress of this report by loading the following > URL in a web browser: > http://www.OpenLDAP.org/its/index.cgi?findid=8819 > > Please remember to retain your issue tracking number (ITS#8819) > on any further messages you send to us regarding this report. If > you don't then you'll just waste our time and yours because we > won't be able to properly track the report. > > Please note that the Issue Tracking System is not intended to > be used to seek help in the proper use of OpenLDAP Software. > Such requests will be closed. > > OpenLDAP Software is user supported. > http://www.OpenLDAP.org/support/ > > -------------- > Copyright 1998-2007 The OpenLDAP Foundation, All Rights Reserved. >
Looks like another type aliasing problem to me. The data is accessed through an MDB_page* variable. This tells the compiler that the data is word-aligned, like struct MDB_page. Fix: Use a void/char pointer, don't lie to the compiler.
That's news to me. Then I googled it. You're right. From https://wiki.sei.cmu.edu/confluence/display/c/EXP36-C.+Do+not+cast+pointers+into+more+strictly+aligned+pointer+types The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Nic On Tue, Mar 20, 2018 at 9:41 AM, Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no> wrote: > Looks like another type aliasing problem to me. The data is accessed > through an MDB_page* variable. This tells the compiler that the data > is word-aligned, like struct MDB_page. Fix: Use a void/char pointer, > don't lie to the compiler. >
nic@nicwatson.org wrote: > That's news to me. Then I googled it. You're right. > >>From https://wiki.sei.cmu.edu/confluence/display/c/EXP36-C.+Do+not+cast+pointers+into+more+strictly+aligned+pointer+types > > The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states > > A pointer to an object or incomplete type may be converted to a > pointer to a different object or incomplete type. If the resulting > pointer is not correctly aligned for the referenced type, the behavior > is undefined. > > Nic > > On Tue, Mar 20, 2018 at 9:41 AM, Hallvard Breien Furuseth > <h.b.furuseth@usit.uio.no> wrote: >> Looks like another type aliasing problem to me. The data is accessed >> through an MDB_page* variable. This tells the compiler that the data >> is word-aligned, like struct MDB_page. Fix: Use a void/char pointer, >> don't lie to the compiler. Good catch. We once discussed padding odd-length keys to make sure the data was still word-aligned. Maybe should do that in LMDB 1.0. This particular crash is now fixed in mdb.master. I've left other derefs of *fp alone for the moment but may need to revisit that later; older ARM and SPARC would probably choke on them. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Test moved from Incoming to Software Bugs
changed notes changed state Test to Release
fixed in mdb.master fixed in mdb.re09 (0.9.22)
changed notes changed state Release to Closed
On 20/03/2018 19:58, hyc@symas.com wrote: > We once discussed padding odd-length keys to make sure the data was still > word-aligned. Maybe should do that in LMDB 1.0. This particular crash is now > fixed in mdb.master. I've left other derefs of *fp alone for the moment but > may need to revisit that later; older ARM and SPARC would probably choke on them. Yes. Also, as this bug demonstrates, compilers will keep finding new ways to break over-aligned pointers even on x86. The way to make sure a compiler cannot deduce that a sub-page is 8- or 4- byte aligned, is to never create such over-aligned pointer values. I.e. pass something like struct MDB_pageinfo instead of MDB_page to anything which may receive a 2-byte-aligned sub-page: typedef struct MDB_pageinfo { uint16_t mi_pad, mi_flags; indx_t mi_lower, mi_upper; # define MI_OVPAGES(mi) (((unsigned)(mi)->mi_upper<<16) + (mi)->mi_lower) } MDB_pageinfo; typedef struct MDB_page { pgno_t mp_pgno; MDB_pageinfo mp_info; indx_t mp_ptrs[1]; } MDB_page;