[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Please test RE24



Rein Tollevik wrote:
To me it looks more as the extended test050 have triggered race
conditions that already was there, and that especially the syncprov half
of ITS#5973 have added to the likelihood that they should be shown.

I have run the current test050 script with the 2.4.15 source (which
didn't include these patches), and with RE24 (as of two days ago)
without ITS#5973, and have seen the same type of failures there.  Also,
had the problems been triggered by the consumers receiving NEW_COOKIE
messages then I would have expected to see "too old" messages on the
consumers when it ignores entries.  Instead, I find no trace of the
missing entries ever being passed on from the provider.  But where the
update is lost I haven't found out yet.  The problem seem to occur when
the server where entries are missing receives its updates from one of
the other consumers (i.e, not directly from server1).  But whether it is
syncrepl on this intermediate server that fails to pass it on to
syncprov, or syncprov that looses them, I don't know.

Also, I now have around 30 core files similar to the one in ITS#5999,
and I have also had a number of cases where I had to kill -9 a slapd
running in a tight unlock, yield, lock loop at the same place in
syncprov_op_mod().  These loops have all happened when slapd should be
stopping, and the mt structure looks equally invalid as with the seg.
fault cases.  I have no idea as to whether this has anything to do with
the test050 failures or not.

Btw, all of the test050 failures I have seen due to missing replications
have taken place immediately after the initial loading of the consumers
from server1.  This could be a coincident, but I have had enough or them
to start wondering...

Yes, I've seen the same. My suspicion now is that it's due to an update arriving in the consumer near when it transitions from refresh to persist mode, but I haven't been able to isolate it. I also note that adding a SLEEP1 near the beginning of test050, after the consumers have been started but before the ldapadd to populate the privder, completely eliminated the problem. So there's definitely an issue there that needs to be tracked down.


I've also seen the op_mod spin during shutdown. Unfortunately with the rest of the state already destroyed we can't identify what led to it. Seems we need to run the test a few times without restarting the servers to track that.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/