[Date Prev][Date Next] [Chronological] [Thread] [Top]

Sync replication failure during startup.



OpenLDAP v. 2.3.32
Berkeley DB 4.6
gcc 4.1.0


Replication doesn't work if the master server is started after
the replica servers and a large amount of simoultaneous updates
are performed while the server is starting up.

The entries that didn't get replicated to the replicas will not
be replicated even after a restart of both master and replicas.
The contextCSN is set to a value larger than the entryCSN of the
"lost" entries.

This is what I think happens during a master server startup with
simoultaneous updates ongoing (and replicas trying to sync in the
initial phase).

Suppose that two clients (Client1 and Client2) are adding the entries
a and b respectively. If that happens between t1 and t2 (one second
between)
they will get the same entryCSN (same timestamp). If entry a is
committed
at tc1 and b at tc2, any replica search inbetween will only get the
entry a. The entry b will be lost.

Client1       entry=a, csn=x  

Client2          entry=b, csn=x

Timeline ------+----------+---------+----+------>
                          |         |
               t1         |         |     t2=t1+1
                          |         |
                     tc1=entry a  tc2=entry b
                     committed    committed


                        Replica search query between tc1 and tc2.


I don't know if a higher granularity would prevent this, or even better,
to have some kind of a counter so that every modification gets a unique
csn.

Can you please comment on our analyzis to let us know if the analyzis is
correct or if we have missed something important?

Any help or hints on how to avoid or fix this problem is greatly
appreciated. 

If I receive useful information direcly in private email, I will post a
summary.

Regards

Stelios Grigoriadis