[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#5454) syncrepl refreshAndPersist stops receiving



On Thu, 3 Apr 2008, rein@basefarm.no wrote:

> Our persistent syncrepl consumers stops receiving data after a while, with no
> indication of why :-(  They don't recognize restarts of the producer, so the
> only way to get the replication running again is to restart the consumers.
>
> So far it looks to me as if the syncrepl thread has managed to return without
> adding the connection socket back to slap_daemon.sd_readers.  But whether that
> is correct or not, and how it has managed to do it if so, I cannot tell.

Hm, I hope I have found the race condition that causes this :-)  I'm now
running with the patch at the end to see if that solves it, only time will
tell..

The race is that between the time selecting on the syncrepl socket is
enabled by the call to connection_client_enable() and the release of the
si_mutex a new message may arrive.  If so, the next call to do_syncrepl
may fail in its attempt to trylock the mutex and no-one will re-enable
selecting on it again.  My patch delays enabling of the socket until the
mutex has been released, which looks safe to me.  Or can the access to
si->si_conn without a lock be a problem?

Rein

Index: OpenLDAP/servers/slapd/syncrepl.c
diff -u OpenLDAP/servers/slapd/syncrepl.c:1.9 OpenLDAP/servers/slapd/syncrepl.c:1.10
--- OpenLDAP/servers/slapd/syncrepl.c:1.9	Fri Mar 28 14:25:55 2008
+++ OpenLDAP/servers/slapd/syncrepl.c	Fri Apr  4 16:54:05 2008
@@ -1166,6 +1166,7 @@
  	ber_socket_t s;
  	int i, defer = 1, fail = 0;
  	Backend *be;
+	int enable_conn = 0;

  	Debug( LDAP_DEBUG_TRACE, "=>do_syncrepl %s\n", si->si_ridtxt, 0, 0 );

@@ -1271,11 +1272,7 @@
  				 * If we failed, tear down the connection and reschedule.
  				 */
  				if ( rc == LDAP_SUCCESS ) {
-					if ( si->si_conn ) {
-						connection_client_enable( si->si_conn );
-					} else {
-						si->si_conn = connection_client_setup( s, do_syncrepl, arg );
-					} 
+					enable_conn = 1;
  				} else if ( si->si_conn ) {
  					dostop = 1;
  				}
@@ -1342,6 +1339,14 @@
  	ldap_pvt_thread_mutex_unlock( &slapd_rq.rq_mutex );
  	ldap_pvt_thread_mutex_unlock( &si->si_mutex );

+	if ( enable_conn ) {
+		if ( si->si_conn ) {
+			connection_client_enable( si->si_conn );
+		} else {
+			si->si_conn = connection_client_setup( s, do_syncrepl, arg );
+		}
+	}
+
  	if ( rc ) {
  		if ( fail == RETRYNUM_TAIL ) {
  			Debug( LDAP_DEBUG_ANY,