[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5439) back-bdb race condition



Rein Tollevik wrote:
> On Tue, 25 Mar 2008, hyc@symas.com wrote:
> It now looks to me as if the entire rs->sr_entry was released an reused,
> and that the bug probably is in back-bdb.  It just always happened when
> syncprov was running as this is a master server with mostly writes
> except from the reads syncprov does.  Which also means that the title on
> the bug report is probably misleading :-(
>
> The reason for my shift of focus is some highly suspicious variables
> found in the bdb_search() frame.

Look at the value of idflag in bdb_search. (see the comment around 672)

Could be a locking bug with ID_NOCACHE in cache.c...

> First, ei->bei_e seem to be NULL every time this happen, not the same as
> the "e" variable as I would except from looking at the code.  Second,
> e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or
> less completely initialized) entry following the *ei entry in the
> database, while I would expect that ei->bei_id==e->e_id and that
> ei->bei_e==e.  I don't think the consecutive ID numbers or entries has
> anything with the bug to do though, they are probably a result of the
> searches reading the entire database.
>
> I looks to me as if the EntryInfo referred to by ei was released somehow
> when slap_send_search_entry() was running, and that the ei->bei_e passed
> as rs->rs_entry was first released and then reused for the next entry.
> Alternatively that the structures wasn't properly initialized and
> thereby appeared free to the second thread.
>
> I have found a core file that adds to my suspicions, this time with two
> active threads.  Their stack frames are shown at the end.  In this core
> file the id variable in the bdb_search() frame is 9318, the second
> thread is loading id==9319.  The value of the "e" variable in
> bdb_search() for the first thread equals the "x" variable of the
> entry_decode() frame in the second thread.  Since entry_decode() has
> just allocated its "x" value it shouldn't have found the same value as
> the other thread has unless someone released its "e" entry while it was
> still using it.  Why this happened I haven't figured out yet.
>
> The cachsesize directive has its default value, which is way below the
> optimal value.  The result is that the cache is always full.  I don't
> know if that can have any influence on the bug?  I tried to lower its
> value to see if that should help reproducing the problem, but not so
> far.
>
>>> I'm currently trying to gather more information related to this bug, any
>>> pointers as to what I should look for is appreciated.  I'm posting this bug
>>> report now in the hope that the stack frame should enlighten someone with better
>>> knowledge of the code than what I have.
>> Check for stack overruns, compile without optimization and make sure it's not
>> a compiler optimization bug, etc.
>
> I'll try to run it with a version compiled with only -g and see if the
> same thing should happen then.
>


-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/