7841 – high disk utilization

Issue 7841 - high disk utilization

Summary: high disk utilization

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.38
Hardware:	All All

Importance:	--- normal
Target Milestone:	---
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-04-21 07:07 UTC by dmitrii.fonariuk@gmail.com
Modified:	2017-09-11 16:11 UTC (History)
CC List:	0 users

See Also:

Attachments
0001-TRIVIA-lmdb-clean-testdb-dir-while-make-test.patch (663 bytes, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0002-EXTENSION-lmdb-more-usefull-info-from-mdb_stat-tool.patch (4.04 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0003-CHANGE-lmdb-backend-checkpoint-interval-in-seconds-i.patch (4.62 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0004-FEATURE-lmdb-implementation-of-checkpoint-kbytes.patch (5.86 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0005-FEATURE-lmdb-backend-support-for-checkpoint-kbytes-c.patch (2.13 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0006-BUGFIX-lmdb-properly-sync-meta-pages-in-mdb_sync_env.patch (3.46 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0007-FEATURE-lmdb-MDB_LIFORECLAIM-MDB_COALESCE-modes.patch (19.10 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0008-FEATURE-lmdb-backend-support-config-for-lifo-and-coa.patch (877 bytes, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0009-BUGFIX-lmdb-volatile-to-important-fields-which.patch (1.07 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0010-BUGFIX-lmdb-reordering-of-instructions-which-update-.patch (1.56 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
0011-BUGFIX-lmdb-lock-meta-pages-in-writemap-mode-to-avoi.patch (1.17 KB, patch) 2014-10-02 22:04 UTC, Leonid Yuriev	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description dmitrii.fonariuk@gmail.com 2014-04-21 07:07:25 UTC

Full_Name: Dmitrii Fonariuk
Version: 2.4.38
OS: rhEL6.x86_64
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (91.210.4.1)


There is a big value of DISK WRITE parameter in utility Iotop when in MDB a lot
of Free pages (Freelist Status). Supposedly this situation arises from memory
management algorithm. The algorithm FIFO is used for pages block allocation in
free pages pool. We touched and dirty different pages on every modification
transaction, which then flushed to disk by system process Flush. Perhaps it
would be better to use the LIFO, which will to dirty the same pages by different
transactions, which reduces the load on the disk?
we use MDB with EnvFlags writemap and mapasync.

Comment 1 Leonid Yuriev 2014-10-02 22:04:37 UTC

The attached patch file is derived from OpenLDAP Software. All of the
modifications to OpenLDAP Software represented in the following patch(es)
were developed by Leonid Yuriev <leo@yuriev.ru>. I have not assigned rights
and/or interest in this work to any party.

The attached modifications to OpenLDAP Software are subject to the
following notice:

Copyright 2014 Leonid Yuriev.
Copyright 2014 Peter-Service LLC, Moscow, Russia.
Redistribution and use in source and binary forms, with or without
modification, are permitted only as authorized by the OpenLDAP Public
License.

Comment 2 Leonid Yuriev 2014-10-02 22:20:42 UTC

The attached patch file is derived from OpenLDAP Software. All of the
modifications to OpenLDAP Software represented in the following
patch(es) were developed by Leonid Yuriev <leo@yuriev.ru>. I have not
assigned rights and/or interest in this work to any party.

The attached modifications to OpenLDAP Software are subject to the
following notice:

Copyright 2014 Leonid Yuriev.
Copyright 2014 Peter-Service LLC, Moscow, Russia.
Redistribution and use in source and binary forms, with or without
modification, are permitted only as authorized by the OpenLDAP Public
License.

https://github.com/leo-yuriev/openldap-lmdb-challenge/pull/1
or
https://github.com/leo-yuriev/openldap-lmdb-challenge/ branch master-devel

commit 841059330fd44769e93eb4b937c3ce42654fad6f
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-20 07:16:15 +0400

     BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected write,
               before the data pages would be synchronized.

     Without locking the meta-pages may be writen by OS before other data,
     in this case database would be inconsistent.

commit 6240c3350e8bd86337c7e41722cf6a38881f15e7
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-12 01:32:13 +0400

     BUGFIX - lmdb: reordering of instructions which update the txn in
a meta-page.

     Without "volatile" or memory-barrier compiler may reorder instructions
     for update the "mm_txnid" field in meta-page in "writemap" mode.

     From the reader's point of view this cause a short
     time interval when the transaction is corrupted.

commit accef62de7fe5660f870f4c5da319a2a8098b2fb
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-21 02:29:50 +0400

     BUGFIX - lmdb: 'volatile' to important fields, which
               may be updated by readers asynchronously.

     Without 'volatile' compiler may eliminate a mdb_find_oldest() calls.

commit bb83e03cf1b8bceee64550229c3becbdd5400680
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-19 20:18:17 +0400

     FEATURE - lmdb-backend: support config for 'lifo' and 'coalesce' envflags.

commit 0c168d0e63ed78d13df3fc8a42f3667335678639
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-20 10:13:28 +0400

     FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.

     Reclaim FreeDB in LIFO order - this is a main feature.
     Also aim to coalesce small FreeDFB records.

commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-19 22:47:19 +0400

     BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().

     Meta-pages may be updated during data-syncing in mdb_sync_env(),
     in this case database would be inconsistent.

     Check-and-retry if lead txn-id changed during flushing data in
mdb_sync_env().

commit 908677f989588d06b9f00620576dea3c5c8675d7
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 16:10:05 +0400

     FEATURE - lmdb-backend: support for "checkpoint kbytes" config-option.

commit 147f41a8110f28456bc32123bde86d47183f9c0a
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 16:01:15 +0400

     FEATURE - lmdb: implementation of "checkpoint kbytes".

     Force flush when volume of the changes reached a configurable threshold.

commit fb82a0b688f4c31313d0790415feda8aaa18651c
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 15:18:16 +0400

     CHANGE - lmdb-backend: checkpoint-interval in seconds instead of minutes.

commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-20 06:51:04 +0400

     EXTENSION - lmdb: more usefull info from mdb_stat tool.

commit ccc7da690ffbff440643295b945fdf7886f48c97
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-05 00:19:16 +0400

     TRIVIA - lmdb: clean testdb-dir while "make test".

Comment 3 Howard Chu 2014-10-02 23:13:47 UTC

leo@yuriev.ru wrote:
> The attached patch file is derived from OpenLDAP Software. All of the
> modifications to OpenLDAP Software represented in the following
> patch(es) were developed by Leonid Yuriev <leo@yuriev.ru>. I have not
> assigned rights and/or interest in this work to any party.
>
> The attached modifications to OpenLDAP Software are subject to the
> following notice:
>
> Copyright 2014 Leonid Yuriev.
> Copyright 2014 Peter-Service LLC, Moscow, Russia.
> Redistribution and use in source and binary forms, with or without
> modification, are permitted only as authorized by the OpenLDAP Public
> License.
>
> https://github.com/leo-yuriev/openldap-lmdb-challenge/pull/1
> or
> https://github.com/leo-yuriev/openldap-lmdb-challenge/ branch master-devel
>
> commit 841059330fd44769e93eb4b937c3ce42654fad6f
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-20 07:16:15 +0400
>
>       BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected write,
>                 before the data pages would be synchronized.
>
>       Without locking the meta-pages may be writen by OS before other data,
>       in this case database would be inconsistent.

Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but 
that risk is already documented.
>
> commit 6240c3350e8bd86337c7e41722cf6a38881f15e7
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-12 01:32:13 +0400
>
>       BUGFIX - lmdb: reordering of instructions which update the txn in
> a meta-page.
>
>       Without "volatile" or memory-barrier compiler may reorder instructions
>       for update the "mm_txnid" field in meta-page in "writemap" mode.
>
>       From the reader's point of view this cause a short
>       time interval when the transaction is corrupted.

OK.
>
> commit accef62de7fe5660f870f4c5da319a2a8098b2fb
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-21 02:29:50 +0400
>
>       BUGFIX - lmdb: 'volatile' to important fields, which
>                 may be updated by readers asynchronously.
>
>       Without 'volatile' compiler may eliminate a mdb_find_oldest() calls.

OK.
>
> commit bb83e03cf1b8bceee64550229c3becbdd5400680
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-19 20:18:17 +0400
>
>       FEATURE - lmdb-backend: support config for 'lifo' and 'coalesce' envflags.
>
> commit 0c168d0e63ed78d13df3fc8a42f3667335678639
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-20 10:13:28 +0400
>
>       FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
>
>       Reclaim FreeDB in LIFO order - this is a main feature.
>       Also aim to coalesce small FreeDFB records.

Will spend more time looking at this closer.
>
> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-19 22:47:19 +0400
>
>       BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>
>       Meta-pages may be updated during data-syncing in mdb_sync_env(),
>       in this case database would be inconsistent.
>
>       Check-and-retry if lead txn-id changed during flushing data in
> mdb_sync_env().

Probably could simplify this, just obtain the write mutex unconditionally, 
then there's no need to loop or retry. But also, this depends on MDB_NOLOCK - 
if that's set, then do no locking at all.

> commit 908677f989588d06b9f00620576dea3c5c8675d7
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-04 16:10:05 +0400
>
>       FEATURE - lmdb-backend: support for "checkpoint kbytes" config-option.

OK if the lmdb implementation is OK.
>
> commit 147f41a8110f28456bc32123bde86d47183f9c0a
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-04 16:01:15 +0400
>
>       FEATURE - lmdb: implementation of "checkpoint kbytes".
>
>       Force flush when volume of the changes reached a configurable threshold.

Probably OK. Needs some typographical cleanup. Not sure "syncbytes" is a good 
name.
>
> commit fb82a0b688f4c31313d0790415feda8aaa18651c
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-04 15:18:16 +0400
>
>       CHANGE - lmdb-backend: checkpoint-interval in seconds instead of minutes.

Gratuitous change. We used minutes since the BDB backend uses minutes, and the 
intention was to maintain parallel functionality. What's the justification for 
this change?
>
> commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-20 06:51:04 +0400
>
>       EXTENSION - lmdb: more usefull info from mdb_stat tool.

A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not 
the "last" reader. I'm not convinced of the value of this patch, since you can 
already view the readers list.

> commit ccc7da690ffbff440643295b945fdf7886f48c97
> Author: Leo Yuriev <leo@yuriev.ru>
> Date:   2014-09-05 00:19:16 +0400
>
>       TRIVIA - lmdb: clean testdb-dir while "make test".

OK.


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 4 Leonid Yuriev 2014-10-03 00:55:17 UTC

2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
>> commit 841059330fd44769e93eb4b937c3ce42654fad6f
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 07:16:15 +0400
>>
>>       BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
>> write,
>>                 before the data pages would be synchronized.
>>
>>       Without locking the meta-pages may be writen by OS before other
>> data,
>>       in this case database would be inconsistent.
>
>
> Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
> that risk is already documented.

We are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1

If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.

In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
But it requires a lot of changes and I do not do that.

>> commit 0c168d0e63ed78d13df3fc8a42f3667335678639
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 10:13:28 +0400
>>
>>       FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
>>
>>       Reclaim FreeDB in LIFO order - this is a main feature.
>>       Also aim to coalesce small FreeDFB records.
>
> Will spend more time looking at this closer.

I would be suggested, but do not insist, review this patch on github.

>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-19 22:47:19 +0400
>>
>>       BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>
>>       Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>       in this case database would be inconsistent.
>>
>>       Check-and-retry if lead txn-id changed during flushing data in
>> mdb_sync_env().
>
> Probably could simplify this, just obtain the write mutex unconditionally,
> then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
> - if that's set, then do no locking at all.

I did so for reasons of performance and less a lock retention time.

Retries will be if there an intensive flow of changes.
In this case it will be a lot of updated pages, the record which will
take some time.

However, in subsequent iterations (if a transactions had committed
while there was a record),
the modified pages will be much fewer, and the sync will be quick.

Thus (and it was seen in tests) even when a substantial amount of the
transactions,
usually only two iterations of the cycle,
without locking and flow of changes are not suspended.

>> commit 147f41a8110f28456bc32123bde86d47183f9c0a
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-04 16:01:15 +0400
>>
>>       FEATURE - lmdb: implementation of "checkpoint kbytes".
>>
>>       Force flush when volume of the changes reached a configurable
>> threshold.
>
>
> Probably OK. Needs some typographical cleanup. Not sure "syncbytes" is a
> good name.

Agree.
I just took the first choice and try to retaining the style.
Ideas?

>> commit fb82a0b688f4c31313d0790415feda8aaa18651c
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-04 15:18:16 +0400
>>
>>       CHANGE - lmdb-backend: checkpoint-interval in seconds instead of
>> minutes.
>
>
> Gratuitous change. We used minutes since the BDB backend uses minutes, and
> the intention was to maintain parallel functionality. What's the
> justification for this change?

As I had wrote above, we are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1

If the interval is specified in minutes, then it can not be set less
than one minute.
But it's too big amount of time to allow lost the updates.

However, setting the synchronization interval of one second,
we reduce the amount of losses in the event of an accident to an
acceptable level,
while the load on the storage system is acceptable even for a large
flow of updates.

As a result, I have not found a better solution than simply replace
the minutes by the seconds.

>> commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 06:51:04 +0400
>>
>>       EXTENSION - lmdb: more usefull info from mdb_stat tool.
>
>
> A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not
> the "last" reader. I'm not convinced of the value of this patch, since you
> can already view the readers list.

I am agree then "tail" is a best choice.
But the main value of this patch is not to show a txn of oldest
reader, but to show an info about pages usage.
Especially the amount of pages which are "blocked" by oldest (laggard)
reader, and how much pages are actually available.

> --
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Thank you in advance.
BR.
Leonid Yuriev.

Comment 5 Leonid Yuriev 2014-10-03 20:13:16 UTC

As directed by Kurt Zeilenga (Executive Director, Kurt@openldap.org) I
was re-submitted the new ITS#7958 with updated IPR statement.
http://www.openldap.org/its/index.cgi/Incoming?id=7958;selectid=7958

Best regards,
Leonid.

Comment 6 OpenLDAP project 2017-09-11 16:11:23 UTC

See ITS#7958

Comment 7 Quanah Gibson-Mount 2017-09-11 16:11:23 UTC

changed notes
changed state Open to Closed