[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Please test RE24

Howard Chu wrote:
Howard Chu wrote:
In at least one case I'm seeing a valid
update being rejected because the incoming cookie seems to have been confused
with another one. This happens when a NEW_COOKIE message is received. I'll
note that sending NEW_COOKIE messages is a recent change (ITS#5972), and there
is no valid case for them to be occurring in test050. I.e., NEW_COOKIE should
be sent in a partial replication situation, where an entry changed in the
naming context but it's not within the consumer's scope of interest. In
test050, the consumer's scope of interest is the entire naming context. So
this at least gives me one area to look for a fix.

I agree, in a MMR configuration NEW_COOKIE messages should not have been sent, except possibly when the entire csn set is updated at the end of a refresh phase. But is looks more and more to me as if the fact that test050 do show these messages is a symptom of some entry updates being ignored by syncprov, or not passed to syncprov by syncrepl.

This piece of the ITS#5972 patch is part of the problem
--- syncprov.c    5 Mar 2009 16:53:01 -0000    1.266
+++ syncprov.c    12 Mar 2009 08:42:54 -0000
@@ -1245,7 +1245,7 @@
         } else if ( !saveit && found ) {
             /* send DELETE */
             syncprov_qresp( opc, ss, LDAP_SYNC_DELETE );
-        } else if ( !saveit ) {
+        } else if ( !saveit && !fc.fscope ) {
             syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE );
         if ( !saveit && found ) {

My diff above is also not the correct fix, which is why I haven't committed it yet.

The current operation may not have been caught by the previous if conditions for 3 reasons:
1) the change is out of the consumer's scope
2) the change doesn't match the consumer's filter
3) the change is older than the consumer's cookie

The NEW_COOKIE message must only be sent for conditions 1 and 2, but it's currently also being sent for 3. Since the cookie comparison is tacked onto the consumer's filter, an additional comparison is needed to weed this out.
(Normally 3 can't be true, but this is MMR where the consumer might have already received this change from some other provider.)

Syncprov generally doesn't know the exact state of its consumers in MMR configurations, since the consumers CSNs could have been updated by one of the other providers. So, the NEW_COOKIE messages should be sent in all three cases, leaving the job of filtering out the too old CSNs to the one that have enough information to do so, namely the consumer.

I haven't looked yet, but I suspect there is a corresponding bug in the consumer where it acts on a NEW_COOKIE message whether it's valid or not.

No, the consumer silently ignores updates to CSN values older (or equal) to the values it already knows about.

I'm also inclined to back out #5972 and its related patches (#5973, #6001) for
this release. We were looking for bug fixes and stability, and they've been quite destabilizing.

To me it looks more as the extended test050 have triggered race conditions that already was there, and that especially the syncprov half of ITS#5973 have added to the likelihood that they should be shown.

I have run the current test050 script with the 2.4.15 source (which didn't include these patches), and with RE24 (as of two days ago) without ITS#5973, and have seen the same type of failures there. Also, had the problems been triggered by the consumers receiving NEW_COOKIE messages then I would have expected to see "too old" messages on the consumers when it ignores entries. Instead, I find no trace of the missing entries ever being passed on from the provider. But where the update is lost I haven't found out yet. The problem seem to occur when the server where entries are missing receives its updates from one of the other consumers (i.e, not directly from server1). But whether it is syncrepl on this intermediate server that fails to pass it on to syncprov, or syncprov that looses them, I don't know.

Also, I now have around 30 core files similar to the one in ITS#5999, and I have also had a number of cases where I had to kill -9 a slapd running in a tight unlock, yield, lock loop at the same place in syncprov_op_mod(). These loops have all happened when slapd should be stopping, and the mt structure looks equally invalid as with the seg. fault cases. I have no idea as to whether this has anything to do with the test050 failures or not.

Btw, all of the test050 failures I have seen due to missing replications have taken place immediately after the initial loading of the consumers from server1. This could be a coincident, but I have had enough or them to start wondering...