[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#9015) Replication goes haywire querying promoted master



Full_Name: Quanah Gibson-Mount
Version: 2.4.47
OS: N/A
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (47.208.144.40)


In testing a particular use case/setup scenario, I found that it's possible to
cause a replica to slam a provider with unending requests.  In this specific
case, I was setting up delta-syncrepl MMR, but I believe the issue applies to
standard syncrepl, and is not MMR specific.  The scenario looks like this:

Initially we have a stand alone server, which no overlays in place.  The
configuration is done via cn=config, which allows for us to update the
configuration without a server restart.

The configuration is modified to load the syncprov and accesslog overlays,
create a new accesslog database, and to send all change data to the accesslog
db.

After that is done, a secondary server is brought online with the same
configuration other than the serverID being different and the syncrepl statement
adjusted.

When the secondary server is started, it pummels the initial provider with
queries like:

Apr 23 06:39:06 anvil4 slapd[28967]: conn=1003 op=361868131 SRCH
base="dc=example,dc=com" scope=2 deref=0 filter="(objectClass=*)"
Apr 23 06:39:06 anvil4 slapd[28967]: conn=1003 op=361868131 SRCH attr=* +
Apr 23 06:39:06 anvil4 slapd[28967]: conn=1003 op=361868131 SEARCH RESULT
tag=101 err=0 nentries=0 text=

(Averaging around 2000 queries/second on my server per syncrepl client).

I believe the problem is that the root entry for the database contains no
contextCSN.  This is likely due to the fact that:

a) There was never a syncprov overlay present until I loaded this one in
b) The serverID was set prior to the syncprov overlay being loaded (So it went
from "0" to "1", with no changes ever recorded for "1").

Now there is a trivial ways to handle this, by making a change on the provider
prior to starting up the other servers.

However, I think the overall behavior is undesirable.  If there is no contextCSN
present, it should not lead to replication clients executing a potential DoS on
the provider.  It also generated ~60GB of logs at loglevel stats in 1 day.