[Date Prev][Date Next] [Chronological] [Thread] [Top]

cn=config replication mistake



Hi,

First, please let me tell you the story of my adventure yesterday.  I'll
summarize my questions at the end.

I've set up a simple master-slave replicated system some time ago (stock
Debian wheezy OpenLDAP, version 2.4.31-1+nmu2):

dn: olcDatabase={0}config,cn=config
olcSyncrepl: {0}rid=1 provider=ldap://elm.niif.hu [...]

dn: olcDatabase={1}mdb,cn=config
olcSyncrepl: {0}rid=2 provider=ldap://elm.niif.hu [...]

The slave opened two connections to the master, and everything worked
fine.  Then I enabled TLS and put in a CNAME record, so that the master
became accessible as ldaps://ldap-master.niif.hu.  I decided to also
switch over the replication traffic to the SSL channel, so ldapmodified
the above attributes to contain provider=ldaps://ldap-master.niif.hu.
This pretty much broke the system, because the master server suddenly
started to replicate from itself, thus became read-only.

Finding no other option, I stopped the "master" slapd and edited back
the providers to their previous values (above) in the
olcDatabase={0}config.ldif and olcDatabase={1}mdb.ldif files under the
cn=config directory of my server configuration.  I know these files
should not be edited, but I found no other way.

This move made the master recognized itself again in the provider URI,
so it did not start replicating and became writeable.  My edits,
however, did not propagate to the slave, probably because I did not
change the internal attributes (entryCSN?) so this was expected.  Also,
slapcat started to report CRC warnings in some LDIF files while dumping
the databases, which was also understandable for the edited ones, but
not so much for cn=config.ldif (if I remember correctly).

I tried to fix these by doing some dummy changes by ldapmodify to the
database entries.  For both, I added an extra olcAccess attribute, then
deleted it.  These operations made the slave switch back its syncrepl
connections to the ldap port from ldaps, but also instantly broke the
slave server, which stopped returning results and instead logged lots of

slapd[27944]: => mdb_idl_fetch_key: cursor failed: Invalid argument (22)

lines.  Having no better idea, I restarted the slave server, which
fortunately returned it to normal working condition.

So, my questions:

1. How does the "self-recognition" (by which the master does not start
   replicating from itself) work, why did it fail when I changed the
   provider URI to ldaps?  Did using a CNAME (instead of some FQDN of
   the server) confuse it?  Could this be fixed by adding an appropriate
   subjectAltName to the server TLS certificate?  Or by adding some
   olcServerID attributes?

2. How could I have handled the read-only situation, instead of editing
   forbidden LDIF files?  Would setting olcMirrorMode have been
   possible (without olcServerIDs around)?

3. Is my setup in a reliable and consistent state now, or should I
   expect sudden future failures?  I mean, were the "cursor failed"
   errors fixed for good by the slave server restart?

Please also feel free to educate me on any other points, as needed. :)
-- 
Thanks,
Feri.