[Date Prev][Date Next] [Chronological] [Thread] [Top]

patch for running slurpd in oneshot mode with non-null status file



If you don't care for the grizzly story, page down until you see the
context diff of the patch (or search for "Index:").

After disabling slurpd for a while, I decided I needed to have my replica
catch up, so I ran slurpd in oneshot mode.  Due to a network failure,
my connection to the remote host was lost, and no doubt slurpd was taken
down due to a HUP signal.

No problem, there's a slurpd.status file with about the right date on it
so I'm trying to restart slurpd to no avail.  Actually, I know I can
restart it by manually editing the slurpd.replog file and deleting
slurpd.status -- I have done this in the past.  But the whole point
of slurpd.status to be able to restart it without that much fuss.

I have a single replica, and I get the following (with -d 65535, config
file parsing skipped):

    Retrieved state information for localhost:7389 (timestamp 956076526.0)
    begin replication thread for localhost:7389
    Replica localhost:7389, skip repl record for NEWSHOST=LOCALHOST,UID=7A8E,O=ECUNET.ORG (old)
    end replication thread for localhost:7389
    Processing in one-shot mode:
    308550 total replication records in file,
    0 replication records to process.
    slurpd: terminating normally

    real    3m49.057s
    user    0m39.361s
    sys     0m5.371s

Interestingly, I always get 0 replicaion records to process, even though
I know that all 308550 in fact apply (except for those already processed
due to time, which is a small fraction of this replog).

in OPENLDAP_REL_ENG_1_2, rq.c line 380, in function Rq_getcount(), can
anybody tell me why it does

            if ( type == RQ_COUNT_NZRC ) {
                if ( re->re_getrefcnt( re ) > 1 ) {
                    count++;
                }
            }

ie. why compare > 1 instead of comparing > 0?  Changing it to > 0 gives
me an inaccurate (but at least non-zero) count.  It is inaccurate because
it does not take into account already processed records.

More important is the question of why slurpd terminates after processing
one already-processed record of the replog.

Here's my gdb session on it:

60          while ( !sglob->slurpd_shutdown &&
(gdb) n
61                  (( re = rq->rq_gethead( rq )) == NULL )) {
(gdb) print re
$5 = (Re *) 0x0
(gdb) n
60          while ( !sglob->slurpd_shutdown &&
(gdb) n
70          rq->rq_unlock( rq );
(gdb) n
71          while ( !sglob->slurpd_shutdown ) {
(gdb) n
72              if ( re != NULL ) {
(gdb) list
67           * When we get here, there's work in the queue, and we have the
68           * queue locked.  re should be pointing to the head of the queue.
69           */
70          rq->rq_unlock( rq );
71          while ( !sglob->slurpd_shutdown ) {
72              if ( re != NULL ) {
73                  if ( !ismine( ri, re )) {
74                      /* The Re doesn't list my host:port */
75                      Debug( LDAP_DEBUG_TRACE,
76                              "Replica %s:%d, skip repl record for %s (not mine)\n",
(gdb) n
73                  if ( !ismine( ri, re )) {
(gdb) n
78                  } else if ( !isnew( ri, re )) {
(gdb) n
80                      Debug( LDAP_DEBUG_TRACE,
(gdb) n
Replica localhost:7389, skip repl record for NEWSHOST=LOCALHOST,UID=7A8E,O=ECUNET.ORG (old)
113             rq->rq_lock( rq );
(gdb) n
114             while ( !sglob->slurpd_shutdown &&
(gdb) n
115                     ((new_re = re->re_getnext( re )) == NULL )) {
(gdb) n
114             while ( !sglob->slurpd_shutdown &&
(gdb) n
116                 if ( sglob->one_shot_mode ) {
(gdb) n
117                     rq->rq_unlock( rq );
(gdb) n
118                     return 0;
(gdb) print new_re
$6 = (Re *) 0x0
(gdb) n
131     }
(gdb) n
replicate (ri_arg=0x140014280) at replica.c:42
42          Debug( LDAP_DEBUG_ARGS, "end replication thread for %s:%d\n",
(gdb) c
Continuing.
end replication thread for localhost:7389





And thus ends the replication thread.
After this, the process continues to run, apparently loads the replog,
because it grows to 678MB, and then:

Processing in one-shot mode:
308550 total replication records in file,
308550 replication records to process.
slurpd: terminating normally

Program exited normally.


Only problem, of course, is that no transactions get replicated to the
slave.  I am testing the following changes as a fix to this.  Basically,
I create the fm thread first.  Then, if in oneshot mode, wait for this
thread to terminate.  Then create the replication threads.  Then, if not
on oneshot mode, wait for the fm thread to terminate.  Then proceed as
normal.

Also, the call to ldap_pvt_thread_initialize() and it's comment above
appear to be indented more than they should, and I've undented them
one level.

Here's a context diff.  If there are no objects, I'll commit it.

Randy

Index: main.c
===================================================================
RCS file: /repo/OpenLDAP/pkg/ldap/servers/slurpd/main.c,v
retrieving revision 1.4.2.5.2.2
diff -c -r1.4.2.5.2.2 main.c
*** main.c	2000/04/24 15:03:23	1.4.2.5.2.2
--- main.c	2000/04/24 19:58:46
***************
*** 101,117 ****
  #endif /* LDAP_DEBUG */
  	lutil_detach( 0, 0 );
  
! 	/* initialize thread package */
! 	ldap_pvt_thread_initialize();
  
      /*
-      * Start threads - one thread for each replica
-      */
-     for ( i = 0; sglob->replicas[ i ] != NULL; i++ ) {
- 	start_replica_thread( sglob->replicas[ i ]);
-     }
- 
-     /*
       * Start the main file manager thread (in fm.c).
       */
      if ( ldap_pvt_thread_create( &(sglob->fm_tid),
--- 101,110 ----
  #endif /* LDAP_DEBUG */
  	lutil_detach( 0, 0 );
  
!     /* initialize thread package */
!     ldap_pvt_thread_initialize();
  
      /*
       * Start the main file manager thread (in fm.c).
       */
      if ( ldap_pvt_thread_create( &(sglob->fm_tid),
***************
*** 124,132 ****
      }
  
      /*
       * Wait for the fm thread to finish.
       */
!     ldap_pvt_thread_join( sglob->fm_tid, (void *) NULL );
  
      /*
       * Wait for the replica threads to finish.
--- 117,141 ----
      }
  
      /*
+      * wait for fm to finish if in oneshot mode
+      */
+     if ( sglob->one_shot_mode ) {
+ 	ldap_pvt_thread_join( sglob->fm_tid, (void *) NULL );
+     }
+ 
+     /*
+      * Start threads - one thread for each replica
+      */
+     for ( i = 0; sglob->replicas[ i ] != NULL; i++ ) {
+ 	start_replica_thread( sglob->replicas[ i ]);
+     }
+ 
+     /*
       * Wait for the fm thread to finish.
       */
!     if ( !sglob->one_shot_mode ) {
! 	ldap_pvt_thread_join( sglob->fm_tid, (void *) NULL );
!     }
  
      /*
       * Wait for the replica threads to finish.
Index: rq.c
===================================================================
RCS file: /repo/OpenLDAP/pkg/ldap/servers/slurpd/rq.c,v
retrieving revision 1.5.2.2.2.2
diff -c -r1.5.2.2.2.2 rq.c
*** rq.c	2000/04/24 14:46:13	1.5.2.2.2.2
--- rq.c	2000/04/24 19:58:46
***************
*** 377,383 ****
  	for ( re = rq->rq_gethead( rq ); re != NULL;
  		re = rq->rq_getnext( re )) {
  	    if ( type == RQ_COUNT_NZRC ) {
! 		if ( re->re_getrefcnt( re ) > 1 ) {
  		    count++;
  		}
  	    }
--- 377,383 ----
  	for ( re = rq->rq_gethead( rq ); re != NULL;
  		re = rq->rq_getnext( re )) {
  	    if ( type == RQ_COUNT_NZRC ) {
! 		if ( re->re_getrefcnt( re ) > 0 ) {
  		    count++;
  		}
  	    }