[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#9015) Replication goes haywire querying promoted master



On Tue, Apr 23, 2019 at 04:48:28PM +0000, quanah@openldap.org wrote:
> In testing a particular use case/setup scenario, I found that it's possible to
> cause a replica to slam a provider with unending requests.  In this specific
> case, I was setting up delta-syncrepl MMR, but I believe the issue applies to
> standard syncrepl, and is not MMR specific.  The scenario looks like this:
> 
> Initially we have a stand alone server, which no overlays in place.  The
> configuration is done via cn=config, which allows for us to update the
> configuration without a server restart.
> 
> [...]
> 
> I believe the problem is that the root entry for the database contains no
> contextCSN.  This is likely due to the fact that:
> 
> [...]
> 
> However, I think the overall behavior is undesirable.  If there is no contextCSN
> present, it should not lead to replication clients executing a potential DoS on
> the provider.  It also generated ~60GB of logs at loglevel stats in 1 day.

Ok, I think this is consumer's fault and limited to refreshAndPersist
deltasync (with or without MMR).

Going by what I think I remember of the consumer code did:
- on set up, it finds out there's no cookie to go by so it goes into
  refresh on the main DB
- main DB responds with success but no/empty cookie
- consumer starts over but again finds itself with no cookie, so it goes
  to step one

But the consumer is actually up to date at that point as the search
suggested, so it should just go ahead and do the refreshAndPersist on
accesslog as it planned to originally. And as operations hit the main
DB, they will replicate accordingly, even if that were to happen after
the original search and before this one[0].

So in that case the behaviour would be as follows:
- on set up, it finds out there's no cookie to go by so it goes into
  refresh on the main DB
- main DB responds with success but no/empty cookie
- consumer starts over, remembering that its cookie (albeit empty) is
  valid, so sends a refreshAndPersist search on accesslog DB
- that yields no traffic, but will give it the right data once anything
  replication-worthy happens, job done
- unless the connection is actually severed (restarts, ...) before
  anything needs replicating, we start from step one but no overhead
  was incurred, we're still fine

[0]. Unless so much time has elapsed between the two searches (that
happen on the same connection BTW) that some accesslog ops have already
been expired. Expiration is usually configured in days, not seconds and
an admin that doesn't notice a consumer going AWOL for that amount of
time probably deserves that.

-- 
OndÅ?ej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP