On Wed, Dec 08, 1999 at 02:12:35PM +0000, ahu@casema.net wrote: > > > Sounds like index corruption. I'm pretty sure we have a bug. Until we > > > have time to look into it, I suggest you rebuild your indices. > > On a side note, this occurs with attributes that have multiple values (1 to > 3 currently). Maybe that's related. As the saying goes, we are in deep shit. We reindexed our LDAP database using the following process: - stop all clients from writing to us - copy the raw data files - run ldbm2ldif on the copied id2entry - run ldif2index on the copy and feed it the ldif created by ldbm2ldif (with another slapd.conf, so it generates the index elsewhere) - stop slapd - copy the generated file (maildrop.gdbm) over the 'broken one' in /var/ldap - start slapd - sync the replicas by copying the files We hoped that this would solve our failing substring searches, ie '(maildrop=*,1234567-1@popstore2.casema.net,*)', which don't return anything, though they should, because the entry is there. However, after reindexing, *new* problems have popped up. Lots of queries started failing. Problems can be solved by reading an object, deleting it, and feeding it back to ldap. But this takes time. We are really desparate currently and try to modify the ldap database as little as possible. We are running OpenLDAP-1.2.6-RELENG which a few days later became 1.2.7. This is on Solaris 2.6 with gdbm backend. Before upgrading to 1.2.8, we want to be sure if this would help us. I read here that 1.2.8 was worse in this respect. Would a move to sleepycat be wise? Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
At 03:32 PM 12/16/99 GMT, ahu@casema.net wrote: >On Wed, Dec 08, 1999 at 02:12:35PM +0000, ahu@casema.net wrote: >> > > Sounds like index corruption. I'm pretty sure we have a bug. Until we >> > > have time to look into it, I suggest you rebuild your indices. >> >> On a side note, this occurs with attributes that have multiple values (1 to >> 3 currently). Maybe that's related. > >As the saying goes, we are in deep shit. We reindexed our LDAP database >using the following process: > > - stop all clients from writing to us > - copy the raw data files you should stop slapd before copying the database files. You only need to copy id2entry. I recommend leaving it down (clients can continue reading from replicas). > - run ldbm2ldif on the copied id2entry I assume you mean ldbmcat. > - run ldif2index on the copy and feed it the ldif created by > ldbm2ldif (with another slapd.conf, so it generates the index I suggest you run ldif2ldbm to rebuild the entire database. > elsewhere) > - stop slapd make sure all slurpd are down and all replogs are removed before starting the master. Then for each replica: stop, replace db with new, restart. Then, restart slurpd. >We are running OpenLDAP-1.2.6-RELENG which a few days later became 1.2.7. >This is on Solaris 2.6 with gdbm backend. Before upgrading to 1.2.8, we want >to be sure if this would help us. I read here that 1.2.8 was worse in this >respect. I believe the enabling of DN substring indexing in 1.2.7 caused some problems, it's disabled by default in 1.2.8. >Would a move to sleepycat be wise? I do not believe the problems are specific to gdbm nor bdb2. ---- Kurt D. Zeilenga <kurt@boolean.net> Net Boolean Incorporated <http://www.boolean.net/>
On Thu, Dec 16, 1999 at 05:07:53PM +0000, kurt@boolean.net wrote: > > - run ldif2index on the copy and feed it the ldif created by > > ldbm2ldif (with another slapd.conf, so it generates the index > > I suggest you run ldif2ldbm to rebuild the entire database. That would be best. Until recently this wasn't an option until we discovered that regenerating on Solaris /tmp is magnitudes faster than doing it on a SCSI disk. Regenerating would've cost hours and hours.. (about 15, we calculated). > >to be sure if this would help us. I read here that 1.2.8 was worse in this > >respect. > > I believe the enabling of DN substring indexing in 1.2.7 caused > some problems, it's disabled by default in 1.2.8. Ok, I'm now running an extra copy of slapd on another port, on another dataset, and we've been doing tracing. The initial results indicate that there is a problem with superblocks: => index_read( "maildrop" "*" "592" ) => ldbm_cache_open( "/var/ldap2/maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= index_read 166 candidates => index_read( "maildrop" "*" "92-" ) => ldbm_cache_open( "/var/ldap2/maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= index_read 364 candidates => index_read( "maildrop" "*" "2-1" ) => ldbm_cache_open( "/var/ldap2/maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= idl_fetch 3659 ids (3659 max) <= index_read 3659 candidates <= substring_comp_candidates 0 idl_free: called with NULL pointer <= filter_candidates 0 <= list_candidates 0 <= filter_candidates 0 send_ldap_result 0:: ber_flush: 14 bytes to sd 5 0 0c 02 01 02 e 07 0a 01 00 04 00 04 00 listening for connections on 3, activity on: 5r before select active_threads 0 Now, the amount of entries returned by idl_fetch is correct, on a 'working' copy of the database, a search in *2-1* does indeed return 3659 entries, including the one we were searching for here. However, when idl_intersection is called, it finds zero remaining objects. Though this may be caused by errors earlier on, I find it suspicious that the error occurs just after searching a superblock. I'll continue debugging 'till I've found the problem, your input would be appreciated. Any patches resulting from this will obviously be posted. Regards, bert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
On Thu, Dec 16, 1999 at 05:28:53PM +0000, ahu@casema.net wrote: (I'm very sorry for flooding the list.. things aren't pretty out here and I just hope that once I give people enough information, somebody's going to say 'Ah!') > => index_read( "maildrop" "*" "2-1" ) > => ldbm_cache_open( "/var/ldap2/maildrop.gdbm", 2, 600 ) > <= ldbm_cache_open (cache 3) > <= idl_fetch 3659 ids (3659 max) > <= index_read 3659 candidates > <= substring_comp_candidates 0 > idl_free: called with NULL pointer > <= filter_candidates 0 > <= list_candidates 0 > <= filter_candidates 0 > send_ldap_result 0:: When performing a search just on *2-1*, we get this: => index_read( "maildrop" "*" "2-1" ) => ldbm_cache_open( "/var/ldap2/maildrop.gdbm", 2, 600 ) ldbm_cache_open (blksize 8192) (maxids 2046) (maxindirect 2) <= ldbm_cache_open (opened 3) <= idl_fetch 3659 ids (3659 max) <= index_read 3659 candidates <= substring_comp_candidates 3659 <= substring_candidates 3659 <= filter_candidates 3659 <= list_candidates 3659 <= filter_candidates 3659 => id2entry_r( 5 ) => ldbm_cache_open( "/var/ldap2/id2entry.gdbm", 2, 600 ) <= ldbm_cache_open (cache 1) => str2entry <= str2entry 0x605c8 entry_rdwr_rlock: ID: 5 <= id2entry_r( 5 ) (disk) => test_filter SUBSTRINGS begin test_substring_filter And now the funny part. It only outputs 2711 entries, instead of 3659 it should. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
>And now the funny part. It only outputs 2711 entries, instead of 3659 it >should. Apparently only 2711 of the 3659 candidates matched the filter. What did the parameters of the search differ from that that returned 3659 entries? ---- Kurt D. Zeilenga <kurt@boolean.net> Net Boolean Incorporated <http://www.boolean.net/>
On Thu, Dec 16, 1999 at 07:28:40PM +0000, kurt@boolean.net wrote: > >And now the funny part. It only outputs 2711 entries, instead of 3659 it > >should. > > Apparently only 2711 of the 3659 candidates matched the filter. > What did the parameters of the search differ from that that > returned 3659 entries? Nothing I can make out. And I stared at it long enough :-) The filter was identical to the key searched (ie (*2-1*)). It's impossible for the entries to end on 2-1, nor can they begin with it. Each and every candidate should match therefore. Per your instructions, I am rebuilding the entire database now, with some surprising results. Some index files actually are larger now then they are in the production ldap! This either means that ldbmcat and ldif2ldbm mess up, or that my ldap was seriously ill. I have upgraded from OpenLDAP 1.2.notmuch to OpenLDAP 1.2.6-RELENG without any intervening reindexing, was this wrong? Once reindexing is finished, I'll report back here. Lots of thanks for the help so far. I'll try and get people here enthousiastic again about supporting the open source projects we depend upon. I hope my tiny patches every now and then make up for some of the time I abuse from people here :-) Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
At 07:41 PM 12/16/99 GMT, ahu@casema.net wrote: >I have upgraded from OpenLDAP 1.2.notmuch to OpenLDAP 1.2.6-RELENG without >any intervening reindexing, was this wrong? It's always a good idea to regenerate databases while upgrading... ---- Kurt D. Zeilenga <kurt@boolean.net> Net Boolean Incorporated <http://www.boolean.net/>
On Thu, Dec 16, 1999 at 07:51:16PM +0000, kurt@boolean.net wrote: > >I have upgraded from OpenLDAP 1.2.notmuch to OpenLDAP 1.2.6-RELENG without > >any intervening reindexing, was this wrong? > > It's always a good idea to regenerate databases while upgrading... It didn't help this time. I'm now trying openldap-1.2.8 on the current database files to see if that helps. If it doesn't, I'll reindex using ldif2ldbm from 1.2.8, and see if it works. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
On Thu, Dec 16, 1999 at 07:53:50PM +0000, ahu@casema.net wrote: > It didn't help this time. I'm now trying openldap-1.2.8 on the current > database files to see if that helps. If it doesn't, I'll reindex using > ldif2ldbm from 1.2.8, and see if it works. :-( I recreated the database with a scratch openldap-1.2.8 with ldif2ldbm from the ldif created by the 1.2.7 ldbmcat, and the problem is still there. Houston, we have a problem. I'll continue researching, but the cause is still within openldap-1.2.8, either in ldif2ldbm (or ldif2index) or in slapd.... Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
At 09:30 PM 12/16/99 GMT, ahu@casema.net wrote: >On Thu, Dec 16, 1999 at 07:53:50PM +0000, ahu@casema.net wrote: > >> It didn't help this time. I'm now trying openldap-1.2.8 on the current >> database files to see if that helps. If it doesn't, I'll reindex using >> ldif2ldbm from 1.2.8, and see if it works. > >:-( > >I recreated the database with a scratch openldap-1.2.8 with ldif2ldbm from >the ldif created by the 1.2.7 ldbmcat, and the problem is still there. You might try using 'ldbmcat -n' and then use ldapadd. ---- Kurt D. Zeilenga <kurt@boolean.net> Net Boolean Incorporated <http://www.boolean.net/>
On Thu, Dec 16, 1999 at 04:20:05PM -0800, Kurt D. Zeilenga wrote: > >I recreated the database with a scratch openldap-1.2.8 with ldif2ldbm from > >the ldif created by the 1.2.7 ldbmcat, and the problem is still there. > > You might try using 'ldbmcat -n' and then use ldapadd. Ok, I'm nearly there. An excerpt from the trace log (which has been expanded a bit by additional logging). id 301 is the one we are looking for. => index_read( "maildrop" "*" "592" ) => ldbm_cache_open( "/var/ldap2//maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= index_read 166 candidates ID_BLOCK_NIDS(a): 3 ID_BLOCK_NIDS(b): 166 [0]301 == [0]301 [102]31938 == [1]31876 [104]32765 == [2]32249 idl_intersection: 1 left => index_read( "maildrop" "*" "92-" ) => ldbm_cache_open( "/var/ldap2//maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= index_read 364 candidates ID_BLOCK_NIDS(a): 1 ID_BLOCK_NIDS(b): 364 [4]301 == [0]301 idl_intersection: 1 left => index_read( "maildrop" "*" "2-1" ) => ldbm_cache_open( "/var/ldap2//maildrop.gdbm", 2, 600 ) <= ldbm_cache_open (cache 3) <= idl_fetch 3659 ids (3659 max) 2 blocks =[0]5 =[1]14 =[2]48 =[3]57 =[4]71 =[5]76 =[6]77 (...) =[2039]48606 =[2040]48625 =[2041]48626 =[2042]48628 =[2043]48633 =[2044]48644 =[2045]48664 =[2046]13 =[2047]16 =[2048]62 =[2049]85 =[2050]233 =[2051]249 =[2052]301 =[2053]360 (...) =[3654]56210 =[3655]56223 =[3656]56234 =[3657]56235 =[3658]56260 <= index_read 3659 candidates ID_BLOCK_NIDS(a): 1 ID_BLOCK_NIDS(b): 3659 [19]302 == [0]301 ni==0 <= substring_comp_candidates 0 This is wrong. Data has been inserted in the wrong block and because data is now no longer in the right order, idl_intersection (which has been optimized) fails. I'm still investigating this further.. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
On Fri, Dec 17, 1999 at 12:45:53AM +0000, ahu@casema.net wrote: > > You might try using 'ldbmcat -n' and then use ldapadd. > > Ok, I'm nearly there. An excerpt from the trace log (which has been expanded > a bit by additional logging). id 301 is the one we are looking for. I'm there :-) If I understand correctly, something very silly is going on. If a direct idl block is full, we split it, whereby every id smaller then the one we are trying to insert goes to the 'lower' block, and everything above to the 'upper' block. We then insert an indirect block containing pointers to the separate blocks. Additional keys are always inserted 'in order' in a block. This way, when we later concatenate all blocks mentioned in the indirect block, the keys are in order, and idl_intersection can take advantage of this. However, something else is also going on, see idl.c: /* insert the id */ switch ( idl_insert( &tmp, id, db->dbc_maxids ) ) { case 0: /* id inserted ok */ if ( (rc = idl_store( be, db, k2, tmp )) != 0 ) { Debug( LDAP_DEBUG_ANY, "idl_store of (%s) returns %d\n", k2.dptr, rc, 0 ); } break; case 1: /* id inserted - first id in block has changed */ /* * key for this block has changed, so we have to * write the block under the new key, delete the * old key block + update and write the indirect * header block. */ rc = idl_change_first( be, db, key, idl, i, k2, tmp ); break; case 2: /* id not inserted - already there, do nothing */ rc = 0; break; case 3: /* id not inserted - block is full */ /* * first, see if it will fit in the next block, * without splitting, unless we're trying to insert * into the beginning of the first block. */ /* is there a next block? */ This shouldn't happen! This means that keys get inserted in blocks where they don't belong, causing the straight concatenation of all blocks to be unprocessable by idl_intersection: for ( ni = 0, ai = 0, bi = 0; ai < ID_BLOCK_NIDS(a); ai++ ) { for ( ; bi < ID_BLOCK_NIDS(b) && ID_BLOCK_ID(b, bi) < bi++ ) { ; /* NULL */ } if ( bi == ID_BLOCK_NIDS(b) ) { break; } if ( ID_BLOCK_ID(b, bi) == ID_BLOCK_ID(a, ai) ) { ID_BLOCK_ID(n, ni++) = ID_BLOCK_ID(a, ai); } } (been mangled a bit by cutting & pasting from the screen) This function assumes that all ids are in order so it can stop searching very rapidly. This assumption is invalidated by the 'is there room in the next block' optimization. Now, this can be fixed by replacing the above snippet by: for ( ni = 0, ai = 0; ai < ID_BLOCK_NIDS(a); ai++ ) for ( bi = 0; bi < ID_BLOCK_NIDS(b); bi++ ) { if ( ID_BLOCK_ID(b, bi) == ID_BLOCK_ID(a, ai) ) { ID_BLOCK_ID(n, ni++) = ID_BLOCK_ID(a, ai); } } This however is lots slower and doesn't scale very well eventually. It is however the only way to fix the problem on an already indexed database. This works instantaneously. If you also apply the since merged fix to stop searching once idl_intersection has cut down the selection to a reasonable number (instead of continuing to intersect with single valued idls), this might be manageable. The true fix however is to disable the 'is there room in the next block' optimization, or make sure that it enters it in any hypothetical space between the end of the current and the beginning of the beginning of the next block. This can be done by checking that the id that is going to be spliced in the next block is larger than the largest id in the current block. We can also be sure that the id will then be the first entry in the next block, necessitating an idl_change_first() call. I leave implementing this as an excercise for the reader. I've been debugging for 16 hours straight and hope to get some sleep now :-) With kind regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
On Fri, Dec 17, 1999 at 02:36:54AM +0000, ahu@casema.net wrote: > for ( ni = 0, ai = 0; ai < ID_BLOCK_NIDS(a); ai++ ) > for ( bi = 0; bi < ID_BLOCK_NIDS(b); bi++ ) > { > if ( ID_BLOCK_ID(b, bi) == ID_BLOCK_ID(a, ai) ) { > ID_BLOCK_ID(n, ni++) = ID_BLOCK_ID(a, ai); > } > } > > > This however is lots slower and doesn't scale very well eventually. It is > however the only way to fix the problem on an already indexed database. This > works instantaneously. Ok, not quite true, someone could also write idl_sort which would take care that concatenated blocks do get fetched in the right order. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
But all in all it would be best to fix the broken ID insertion code. If no one else jumps up, I may have time to hit this in the next day or two. -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc > -----Original Message----- > From: owner-openldap-bugs@OpenLDAP.org > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of ahu@casema.net > Sent: Thursday, December 16, 1999 6:43 PM > To: openldap-its@OpenLDAP.org > Subject: Re: Trouble found: substring searches very broken (ITS#402) > > > On Fri, Dec 17, 1999 at 02:36:54AM +0000, ahu@casema.net wrote: > > > for ( ni = 0, ai = 0; ai < ID_BLOCK_NIDS(a); ai++ ) > > for ( bi = 0; bi < ID_BLOCK_NIDS(b); bi++ ) > > { > > if ( ID_BLOCK_ID(b, bi) == ID_BLOCK_ID(a, ai) ) { > > ID_BLOCK_ID(n, ni++) = ID_BLOCK_ID(a, ai); > > } > > } > > > > > > This however is lots slower and doesn't scale very well > eventually. It is > > however the only way to fix the problem on an already indexed > database. This > > works instantaneously. > > Ok, not quite true, someone could also write idl_sort which would > take care > that concatenated blocks do get fetched in the right order. > > Regards, > > bert hubert. > > -- > +---------------+ | http://www.rent-a-nerd.nl > | nerd for hire | | > +---------------+ | - U N I X - > | | Inspice et cautus eris - D11T'95 > >
On Fri, Dec 17, 1999 at 02:54:29AM +0000, hyc@highlandsun.com wrote: > But all in all it would be best to fix the broken ID insertion code. If no > one else jumps up, I may have time to hit this in the next day or two. It's fine with me. I feel no urge to fix it currently, my 'itch has been scratched'. I hope that I've been able to point people in the right direction however. This may also fix problems other people saw with failing queries. The problem arises as soon as you have substring searches and indirect blocks. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
OK, I have verified that my suggested patch actually fixes this problem. I was able to reproduce the original problem using unmodified 1.2.8 source, and the problem did not recur after patching idl.c. Thanks for your help in zeroing in on the source of the problem.
changed state Open to Test moved from Incoming to Software Bugs
changed notes
On Mon, Dec 20, 1999 at 03:00:20PM +0000, Howard Chu wrote: > OK, I have verified that my suggested patch actually fixes this problem. I > was able to reproduce the original problem using unmodified 1.2.8 source, > and the problem did not recur after patching idl.c. Thanks for your help > in zeroing in on the source of the problem. Any time. So far OpenLDAP has done more for me than the other way around :-) I was wondering however if some kind of announcement should be made about this. Many people will have problems with this without really knowing it, while many queries fail. Not good. The fix isn't simply to upgrade, you need to reindex before your problems disappear, so there is no silent solution. Regards, bert hubert. -- +---------------+ | http://www.rent-a-nerd.nl | nerd for hire | | +---------------+ | - U N I X - | | Inspice et cautus eris - D11T'95
At 08:04 AM 12/21/99 GMT, ahu@casema.net wrote: >I was wondering however if some kind of announcement should be made about >this. It's all over openldap-bugs mailing list. >Many people will have problems with this without really knowing it, >while many queries fail. Not good. Many people do not subscribe to openldap-bugs. That's their problem. >The fix isn't simply to upgrade, you need to reindex before your problems >disappear, so there is no silent solution. An appropriate upgrade notice will be placed in the release announcement.
changed notes changed state Test to Release
changed notes changed state Release to Closed
This bug affects attributes that have substring indexing turned on. To reproduce the bug: regenerate the index for the specified attribute, using ldif2ldbm,ldif2index, or slapindex. The bug only occurs if there are too many IDs for a single ID_BLOCK, so edit dbcache.c to set li->li_dbcache[i].dbc_maxids to a reasonably small number before attempting to reproduce the problem. (I used 32, with a 6500 entry test database.) The bug also can only happen if the IDs being indexed are not already in sorted order. Both of these conditions must be true to see the problem occur, at which point searches on the indexed attribute will not return the complete list of expected results.