[Date Prev][Date Next] [Chronological] [Thread] [Top]

Syncrepl behavior when disk is full



I am using openldap-2.4.11 with syncrepl n-way multimaster replication and I am seeing some strange behavior if one of the masters runs out of disk space.  I would expect this to be handled much the same way a master being offline would be handled but it behaves different.

We are doing some failure testing on our new ldap infrastructure to prevent problems before they happen and to be able to improve our monitoring.  On of the test cases I came up with is what happens on a master or slave if the disk the ldap database is stored on becomes full.  Now granted we do monitor for this normally but if somehow it was missed or filled up very fast we would like to know now what will happen.

So here's the setup:

RHEL 5 i386
VMWare Guest
openldap-2.4.11 (custom RPM, all backends, all overlays, monitor as module, back ldap as module)
BDB backend
overlay accesslog
overlay ppolicy
overlay syncprov
overlay unique
overlay dynlist
overlay refint
overlay memberof

master1 and master2 replicate to one another with mirror mode
slave2 uses master2 as its replica provider (load balancing later)

master1 <-- mirror mode on --> master2 --> slave2 (updates chaned to master2)

Scenario 1:
  master2 runs out of disk space, ldap modify request is issued against master2
Result:
  master2 performs the ldapmodify but master1 and slave2 are not notified

Scenario 1a:
  disk space is freed up on master2
Result:
  change is not replicated to master1 and slave2

Scenario 1b:
  disk space is freed up on master2 and master2 is restarted
Result:
  change is still not replicated to master1 and slave2

Scenario 1c:
  disk space is freed up on master2 and master1 or slave2 restarted
Result:
  change is still not replicated to master1 and slave2

Scenario 2:
  master2 runs out of disk space, ldap modify request is issued against master1
Result: master1, master2 and slave2 are ALL updated

It gets even worse as well.   Additional changes against multiple objects on master2 do not get propogated to master1 and slave2.  The only thing that seems to bring master2 back into sync is to write a change to any objects modified during the time when the disk was full.

Scenario 3:
  master2 runs out of disk space, ldap modify request is issued against master2 for object1
Result:
  master2 performs modify but does not replicate it to master1 and slave2

Scenario 3a:
 disk space is freed on master2, ldap modify request is issued on master2 for object2
Result:
  object2 is modified but changes are not replicated to master1 and slave2

Scenario 3b:
  disk space is freed on master2, ldap modify is issued against master1 for object1
Result:
  master1 performs modify and replicates to master2, master2 replicates to slave2, changes to object1 on master2 are lost

Scenario 3c:
  disk space is freed on master2, ldap modify is issued against master1 for object2
Result:
  master1 performs modify and replicates to master2, master2 replicates to slave2, changes to object2 on master2 are lost

Its as if until a write to object1 which was committed by master2 while out of disk space, is modified on master1 and replicated to master2 that master2 is unable to replicate any changes out that originate on master2.

In the case of a slave running out of disk space, if it didn't fall into sync right away the solution would be to blow away the ldap database and let it do a full sync from the masters.

But in the case where one of the master servers runs out of disk space, what should be done to bring them back in sync without loosing any changes?