8852 – slapd memory use grows continuously with non-delta syncrepl and modifying groups

Issue 8852 - slapd memory use grows continuously with non-delta syncrepl and modifying groups

Summary: slapd memory use grows continuously with non-delta syncrepl and modifying groups

Status:	RESOLVED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.46
Hardware:	All All

Importance:	--- normal
Target Milestone:	2.7.0
Assignee:	Ondřej Kuzník

URL:
Keywords:	replication

Depends on:
Blocks:

Reported:	2018-05-12 04:55 UTC by Ryan Tandy
Modified:	2024-01-15 02:27 UTC (History)
CC List:	2 users (show)

See Also:	8752

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Ryan Tandy 2018-05-12 04:55:24 UTC

Full_Name: Ryan Tandy
Version: 2.4.46
OS: Debian
URL: ftp://ftp.openldap.org/incoming/20180511_rtandy_syncrepl-memory-consumer.tgz
Submission from: (NULL) (70.66.128.207)
Submitted by: ryan


When running object-based syncrepl, and making changes to groups, the provider
slapd uses more and more memory, apparently without bound.

We've discussed this issue before but there was no ITS tracking it
specifically.

original Debian bug: https://bugs.debian.org/725091
and a possibly related openldap-technical post:
https://www.openldap.org/lists/openldap-technical/201503/msg00206.html

reproducer: ftp://ftp.openldap.org/incoming/20180511_rtandy_syncrepl-memory-consumer.tgz

./prepare
./runslapd (backgrounds a provider slapd and a consumer slapd)
./modify (makes a number of modifications on the provider)
./clean (kills both slapds and cleans databases)

Run top in another terminal and watch the memory growth. On my system, the
provider grows to over 3 GB resident and does not shrink even after replication
completes. with delta-syncrepl enabled, the producer's RSS is only around 10
MB.

Reproduced on Debian unstable with 2.4.46 and glibc malloc (glibc 2.27-3) and
tcmalloc_minimal (2.7-1).

Comment 1 Ryan Tandy 2018-05-12 18:55:52 UTC

bisect identifies c365ac359e9c9b483b934c2a1f0bc552645c32fa as the commit 
that introduced this behaviour.

003dfbda574f37bbf1a2240f530ff9fa35ab0801 on RE24 (2.4.20)

commit c365ac359e9c9b483b934c2a1f0bc552645c32fa
Author: Howard Chu <hyc@openldap.org>
Date:   Sun Nov 22 04:42:00 2009 +0000

    ITS#6368 use dup'd entries in response queue

Comment 2 Howard Chu 2018-05-19 06:18:20 UTC

ryan@nardis.ca wrote:
> bisect identifies c365ac359e9c9b483b934c2a1f0bc552645c32fa as the commit
> that introduced this behaviour.
> 
> 003dfbda574f37bbf1a2240f530ff9fa35ab0801 on RE24 (2.4.20)
> 
> commit c365ac359e9c9b483b934c2a1f0bc552645c32fa
> Author: Howard Chu <hyc@openldap.org>
> Date:   Sun Nov 22 04:42:00 2009 +0000
> 
>      ITS#6368 use dup'd entries in response queue

I've run your reproducer and see no memory leak. The response queue will 
indeed grow without bound if the consumer runs slower than the provider, and 
doesn't read responses fast enough. But in the case of this test script, 
eventually the client finishes and the consumer catches up.

The provider's process size may not decrease, but that just means the malloc 
implementation isn't returning freed memory to the kernel - it's not a leak.

This can be verified using mleak, and using SIGPROF to snapshot the memory 
usage of the provider. The simplest way to force the memory use to grow is to 
first suspend the consumer with SIGSTOP. Let the modify client run to 
completion. mleak / SIGPROF will show a large amount of memory in use. Resume 
the consumer with SIGCONT, let it run to completion, and then check with 
SIGPROF on the provider again - all of the response queue memory is freed.

So, conclusively, there is no actual leak. But there's a problem with 
sustained client modifications when the consumer is too slow. Our options here 
are to configure a size limit on the response queue, and hang the client when 
the limit is hit, or to return LDAP_BUSY to the client. Neither of these are 
very attractive options.

Doing batched commits will speed up the consumer, but that feature is only in 2.5.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Quanah Gibson-Mount 2021-06-10 17:44:50 UTC

(In reply to Howard Chu from comment #2)

> Doing batched commits will speed up the consumer, but that feature is only
> in 2.5.

Batched commits for replication were reverted in 79ced664b8597c8c08afcb9d1fd48ca4201fe5f7 and 12dbcc0eb3fd534ba02e3c8ed8fb1e55c964d6af due to issues uncovered in its#8752

Comment 4 Quanah Gibson-Mount 2023-11-02 17:14:44 UTC

Similarly, when I used AWS, it was necessary to have the consumers be set at 4k IOPS while the providers were 3k IOPS.  I.e., it's generally necessary that consumers be faster than providers when processing large sequences of write updates.

Comment 5 Quanah Gibson-Mount 2023-11-02 17:19:43 UTC

May be possible to improve diff code for standard syncrepl to improve performance on the consumer side if the attribute is sorted via sortvals, needs investigation.

Comment 6 Ondřej Kuzník 2023-11-02 17:20:54 UTC

attr_cmp should check the attribute is a sortval and if so, should diff without resolving to a double loop.

Comment 7 Ondřej Kuzník 2023-11-03 11:54:31 UTC

Making attr_cmp do a linear sweep for sortvals attributes (instead of the quadratic match it has to do right now) makes the consumer 7-8x slower than a provider across the board with the environment provided. I might have expected something like 3-4x but that's out of scope for this particular ITS.

Comment 8 Ondřej Kuzník 2023-11-03 13:52:11 UTC

For comparison, using deltasync (and sortvals!) makes the consumer take a similar amount of CPU time (about +50-90 % on the provider's) to process the 10k value additions, just like Ryan noted earlier.

On the other idea, no clue on whether we can somehow limit the amount of data queued up without severely impairing replication progress.

Comment 9 Quanah Gibson-Mount 2024-01-12 21:56:08 UTC

  • 8986f99d 
by Ondřej Kuzník at 2023-11-14T18:09:10+00:00 
ITS#8852 Optimise attr_cmp for sortval attributes