[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5454) syncrepl refreshAndPersist stops receiving



Howard Chu skrev:
> rein@basefarm.no wrote:

>> Hm, I hope I have found the race condition that causes this :-)  I'm now
>> running with the patch at the end to see if that solves it, only time
>> will tell..

And we haven't seen any replication problems since this was implemented 
friday afternoon :-).  But it is still a bit too early to conclude that 
it has been fixed.

>> The race is that between the time selecting on the syncrepl socket is
>> enabled by the call to connection_client_enable() and the release of the
>> si_mutex a new message may arrive.  If so, the next call to do_syncrepl
>> may fail in its attempt to trylock the mutex and no-one will re-enable
>> selecting on it again.  My patch delays enabling of the socket until the
>> mutex has been released, which looks safe to me.  Or can the access to
>> si->si_conn without a lock be a problem?
> 
> How about just moving the enable to after the runqueue manipulation is
> done?

That would still leave a glitch in the window open, unless the initial 
trylock is changed to a regular lock as your checking message suggests.

But, doing that could result in all the threads ending up waiting for 
the lock if do_syncrepl of a refresh only replication is configured to 
run too often.  I assume the trylock is there to avoid this?  What about 
starting out with deferring the next scheduled call, as a scheduled call 
  is not wanted until the first has completed anyhow?  And a rescedule 
is already being done before do_syncrepl finishes.

Deferring should not be necessary if rtask->next_sched.tv_sec is zero, 
i.e it has already been deferred. That would remove these calls for all 
the cases when do_syncrepl is called as a result of a new message on a 
persistent syncrepl connection.

> Just need to be sure that do_syncrepl() isn't entered again before
> si->si_conn gets initialized.

I'm not quite sure what you mean here.  As I see it, the problem is that 
do_syncrepl() must not be called after si->si_conn is enabled and the 
lock is still held.

> It also occurs to me that we probably don't even need to manipulate the
> slapd runqueue in persist mode, when si->si_conn is already set. I.e.,
> in that case we can only have gotten here because of a listener event,
> and not because of a runqueue schedule.

That is probably true.

Rein