[Date Prev][Date Next]
Re: (ITS#5439) back-bdb race condition
Rein Tollevik wrote:
> On Tue, 25 Mar 2008, firstname.lastname@example.org wrote:
> It now looks to me as if the entire rs->sr_entry was released an reused,
> and that the bug probably is in back-bdb. It just always happened when
> syncprov was running as this is a master server with mostly writes
> except from the reads syncprov does. Which also means that the title on
> the bug report is probably misleading :-(
> The reason for my shift of focus is some highly suspicious variables
> found in the bdb_search() frame.
Look at the value of idflag in bdb_search. (see the comment around 672)
Could be a locking bug with ID_NOCACHE in cache.c...
> First, ei->bei_e seem to be NULL every time this happen, not the same as
> the "e" variable as I would except from looking at the code. Second,
> e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or
> less completely initialized) entry following the *ei entry in the
> database, while I would expect that ei->bei_id==e->e_id and that
> ei->bei_e==e. I don't think the consecutive ID numbers or entries has
> anything with the bug to do though, they are probably a result of the
> searches reading the entire database.
> I looks to me as if the EntryInfo referred to by ei was released somehow
> when slap_send_search_entry() was running, and that the ei->bei_e passed
> as rs->rs_entry was first released and then reused for the next entry.
> Alternatively that the structures wasn't properly initialized and
> thereby appeared free to the second thread.
> I have found a core file that adds to my suspicions, this time with two
> active threads. Their stack frames are shown at the end. In this core
> file the id variable in the bdb_search() frame is 9318, the second
> thread is loading id==9319. The value of the "e" variable in
> bdb_search() for the first thread equals the "x" variable of the
> entry_decode() frame in the second thread. Since entry_decode() has
> just allocated its "x" value it shouldn't have found the same value as
> the other thread has unless someone released its "e" entry while it was
> still using it. Why this happened I haven't figured out yet.
> The cachsesize directive has its default value, which is way below the
> optimal value. The result is that the cache is always full. I don't
> know if that can have any influence on the bug? I tried to lower its
> value to see if that should help reproducing the problem, but not so
>>> I'm currently trying to gather more information related to this bug, any
>>> pointers as to what I should look for is appreciated. I'm posting this bug
>>> report now in the hope that the stack frame should enlighten someone with better
>>> knowledge of the code than what I have.
>> Check for stack overruns, compile without optimization and make sure it's not
>> a compiler optimization bug, etc.
> I'll try to run it with a version compiled with only -g and see if the
> same thing should happen then.
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/