[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: cn=config replication mistake



Ferenc Wagner wrote:
Hi,

First, please let me tell you the story of my adventure yesterday.  I'll
summarize my questions at the end.

I've set up a simple master-slave replicated system some time ago (stock
Debian wheezy OpenLDAP, version 2.4.31-1+nmu2):

dn: olcDatabase={0}config,cn=config
olcSyncrepl: {0}rid=1 provider=ldap://elm.niif.hu [...]

dn: olcDatabase={1}mdb,cn=config
olcSyncrepl: {0}rid=2 provider=ldap://elm.niif.hu [...]

The slave opened two connections to the master, and everything worked
fine.  Then I enabled TLS and put in a CNAME record, so that the master
became accessible as ldaps://ldap-master.niif.hu.  I decided to also
switch over the replication traffic to the SSL channel, so ldapmodified
the above attributes to contain provider=ldaps://ldap-master.niif.hu.
This pretty much broke the system, because the master server suddenly
started to replicate from itself, thus became read-only.

Finding no other option, I stopped the "master" slapd and edited back
the providers to their previous values (above) in the
olcDatabase={0}config.ldif and olcDatabase={1}mdb.ldif files under the
cn=config directory of my server configuration.  I know these files
should not be edited, but I found no other way.

This move made the master recognized itself again in the provider URI,
so it did not start replicating and became writeable.  My edits,
however, did not propagate to the slave, probably because I did not
change the internal attributes (entryCSN?) so this was expected.  Also,
slapcat started to report CRC warnings in some LDIF files while dumping
the databases, which was also understandable for the edited ones, but
not so much for cn=config.ldif (if I remember correctly).

I tried to fix these by doing some dummy changes by ldapmodify to the
database entries.  For both, I added an extra olcAccess attribute, then
deleted it.  These operations made the slave switch back its syncrepl
connections to the ldap port from ldaps, but also instantly broke the
slave server, which stopped returning results and instead logged lots of

slapd[27944]: => mdb_idl_fetch_key: cursor failed: Invalid argument (22)

lines.  Having no better idea, I restarted the slave server, which
fortunately returned it to normal working condition.

So, my questions:

1. How does the "self-recognition" (by which the master does not start
    replicating from itself) work, why did it fail when I changed the
    provider URI to ldaps?

As noted here http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master

 Did using a CNAME (instead of some FQDN of
    the server) confuse it?  Could this be fixed by adding an appropriate
    subjectAltName to the server TLS certificate?  Or by adding some
    olcServerID attributes?

2. How could I have handled the read-only situation, instead of editing
    forbidden LDIF files?  Would setting olcMirrorMode have been
    possible (without olcServerIDs around)?

At the moment, manually editing was probably your only course of action. In OpenLDAP 2.5 the slapmodify tool should be used to make changes while slapd is shutdown.

3. Is my setup in a reliable and consistent state now, or should I
    expect sudden future failures?  I mean, were the "cursor failed"
    errors fixed for good by the slave server restart?

Don't know. You're using 2.4.31, current is 2.4.39, possibly you saw a bug that has been fixed. Doesn't sound familiar though.

Please also feel free to educate me on any other points, as needed. :)



--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/