[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: CSN Too Old potential Bug



Burton, Kris - Acision wrote:
All,

I want to ask the list about this before I try to open an ITS to make
sure that I am understanding everything correctly. We are running
OpenLDAP 2.4.11. I selectively tried to back post ITS 5709 to our
source, because we were losing replications. Applying this seemed to
help and reduced the number of lost replications. We are running in
mirror mode using refreshAndPersist, and doing a high volume of adds to
the master, on the order of 100/s. We have run numerous iterations of
the same test with very aggressive NTP updates that are keeping both the
master and consumer within 50 microseconds of one another. Which I saw
recommended as a possible solution in a previous message thread. This
seemed to make little to no difference in the replication loss.

If you're actually using MirrorMode, with all writes going to only one server, then NTP doesn't really matter. The time synchronization is only important when reconciling concurrent updates that occurred on different servers. I.e., it's only important when you're running multimaster (as opposed to mirrormode), and for reconciling any updates that occurred while a MirrorMode failover was happening. From the sounds of it, your test doesn't trigger these criteria.


From looking at the code I was thinking that the lost replications
might be due to entries being queued on the master side in non-ascending
order which I was seeing preceding the replication that would be
rejected on the consumer side. What I thought was happening is that the
logic that traverses the queue to mark committed CSNs and updates the
contextCSN was getting out of sync because of this, and orphaning
replications that were still pending, because they are too old, but in
reality they have never been added to the consumer.

Looking at your debug info, this sounds likely. Yes, please submit this info to the ITS.


I just pulled the latest code from RE24 and reran the test, the latest
code is better than before with just the back post of 5709 on 2.4.11,
but we are still losing a small percentage of the replications with the
“CSN too old” message. With the latest code I am still seeing a
correlation between the out of sync queuing on the master and the
replications that are rejected on the consumer.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/