Issue 7958 - LMDB: LIFO-reclaiming, write-performance improvement & bugfixes
Summary: LMDB: LIFO-reclaiming, write-performance improvement & bugfixes
Status: VERIFIED WONTFIX
Alias: None
Product: LMDB
Classification: Unclassified
Component: liblmdb (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-03 19:41 UTC by Leonid Yuriev
Modified: 2020-06-08 17:38 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Leonid Yuriev 2014-10-03 19:41:47 UTC
Full_Name: Leonid Yuriev
Version: 2.4.40
OS: RHEL7
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (31.130.36.33)


Solution for: ITS#7841 and "OpenLDAP + LMDB Back-End - request 300719-14-EXO"

When using LMDB as a backend under the heavy load with add/modify/delete
transactions, a huge number of disk writes is generated.
In generally this patchset give a bonus of 10-100 times write-performance at the
cost of consistency on disk in a one second.

1. Adds a configurable LIFO-policy for reclaiming of FreeDB records.

Thus, only a small subset of pages will be updated and re-written on disk
repetitive. This allow storage subsystem to effective combine such disk writes.
As a result write-performance grow up to 100 times in case of write-back cache
or "writemap" mode.

2. Checkpoints with consistency and a second exactness.

It is possible and very useful the following settings, for example:
  envflags writemap nosync lifo
  checkpoint 0 1

3. Related bugfixes and minor extensions.

--

The attached files is derived from OpenLDAP Software. All of the modifications
to OpenLDAP Software represented in the following patch(es) were developed by
Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights
and/or interest in this work to any party. I, Leonid Yuriev am authorized by
Peter-Service LLC, my employer, to release this work under the following terms.

Peter-Service LLC hereby places the following modifications to OpenLDAP Software
(and only these modifications) into the public domain. Hence, these
modifications may be freely used and/or redistributed for any purpose with or
without attribution and/or other notice.

commit 841059330fd44769e93eb4b937c3ce42654fad6f
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-20 07:16:15 +0400

     BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected write,
               before the data pages would be synchronized.
   
     Without locking the meta-pages may be writen by OS before other data,
     in this case database would be inconsistent.

commit 6240c3350e8bd86337c7e41722cf6a38881f15e7
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-12 01:32:13 +0400

     BUGFIX - lmdb: reordering of instructions which update the txn in a
meta-page.
   
     Without "volatile" or memory-barrier compiler may reorder instructions
     for update the "mm_txnid" field in meta-page in "writemap" mode.
   
     From the reader's point of view this cause a short
     time interval when the transaction is corrupted.

commit accef62de7fe5660f870f4c5da319a2a8098b2fb
Author: Leo Yuriev <leo@yuriev.ru>
Date:   14-0-09-21 02:29:50 +0400

     BUGFIX - lmdb: 'volatile' to important fields, which
               may be updated by readers asynchronously.
   
     Without 'volatile' compiler may eliminate a mdb_find_oldest() calls.

commit bb83e03cf1b8bceee64550229c3becbdd5400680
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-19 20:18:17 +0400

     FEATURE - lmdb-backend: support config for 'lifo' and 'coalesce' envflags.

commit 0c168d0e63ed78d13df3fc8a42f3667335678639
Author: Leo Yuriev <leo@yuriev.ru>
Date: 202014-09-20 10:13:28 +0400

     FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
   
     Reclaim FreeDB in LIFO order - this is a main feature.
     Also aim to coalesce small FreeDFB records.

commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-19 22:47:19 +0400

     BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
   
     Meta-pages may be updated during data-syncing in mdb_sync_env(),
     in this case database would be inconsistent.
   
     Check-and-retry if lead txn-id changed during flushing data in
mdb_sync_env().

commit 908677f989588d06b9f00620576dea3c5c8675d7
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 16:10:05 +0400

     FEATURE - lmdb-backend: support for "checkpoint kbytes" config-option.

commit 147f41a8110f28456bc32123bde86d47183f9c0a
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 16:01:15 +0400

     FEATURE - lmdb: implementation of "checkpoint kbytes".
   
A0A     Force flush when volume of the changes reached a configurable
threshold.

commit fb82a0b688f4c31313d0790415feda8aaa18651c
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-04 15:18:16 +0400

     CHANGE - lmdb-backend: checkpoint-interval in seconds instead of minutes.

commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-20 06:51:04 +0400

     EXTENSION - lmdb: more usefull info from mdb_stat tool.

commit ccc7da690ffbff440643295b945fdf7886f48c97
Author: Leo Yuriev <leo@yuriev.ru>
Date:   2014-09-05 00:19:16 +0400

     TRIVIA - lmdb: clean testdb-dir while "make test".
Comment 1 Leonid Yuriev 2014-10-03 20:04:13 UTC
Fwd: (ITS#7841) high disk utilization

2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
>> commit 841059330fd44769e93eb4b937c3ce42654fad6f
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 07:16:15 +0400
>>
>>       BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
>> write,
>>                 before the data pages would be synchronized.
>>
>>       Without locking the meta-pages may be writen by OS before other
>> data,
>>       in this case database would be inconsistent.
>
>
> Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
> that risk is already documented.

We are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1

If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.

In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
But it requires a lot of changes and I do not do that.

>> commit 0c168d0e63ed78d13df3fc8a42f3667335678639
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 10:13:28 +0400
>>
>>       FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
>>
>>       Reclaim FreeDB in LIFO order - this is a main feature.
>>       Also aim to coalesce small FreeDFB records.
>
> Will spend more time looking at this closer.

I would be suggested, but do not insist, review this patch on github.

>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-19 22:47:19 +0400
>>
>>       BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>
>>       Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>       in this case database would be inconsistent.
>>
>>       Check-and-retry if lead txn-id changed during flushing data in
>> mdb_sync_env().
>
> Probably could simplify this, just obtain the write mutex unconditionally,
> then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
> - if that's set, then do no locking at all.

I did so for reasons of performance and less a lock retention time.

Retries will be if there an intensive flow of changes.
In this case it will be a lot of updated pages, the record which will
take some time.

However, in subsequent iterations (if a transactions had committed
while there was a record),
the modified pages will be much fewer, and the sync will be quick.

Thus (and it was seen in tests) even when a substantial amount of the
transactions,
usually only two iterations of the cycle,
without locking and flow of changes are not suspended.

>> commit 147f41a8110f28456bc32123bde86d47183f9c0a
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-04 16:01:15 +0400
>>
>>       FEATURE - lmdb: implementation of "checkpoint kbytes".
>>
>>       Force flush when volume of the changes reached a configurable
>> threshold.
>
>
> Probably OK. Needs some typographical cleanup. Not sure "syncbytes" is a
> good name.

Agree.
I just took the first choice and try to retaining the style.
Ideas?

>> commit fb82a0b688f4c31313d0790415feda8aaa18651c
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-04 15:18:16 +0400
>>
>>       CHANGE - lmdb-backend: checkpoint-interval in seconds instead of
>> minutes.
>
>
> Gratuitous change. We used minutes since the BDB backend uses minutes, and
> the intention was to maintain parallel functionality. What's the
> justification for this change?

As I had wrote above, we are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1

If the interval is specified in minutes, then it can not be set less
than one minute.
But it's too big amount of time to allow lost the updates.

However, setting the synchronization interval of one second,
we reduce the amount of losses in the event of an accident to an
acceptable level,
while the load on the storage system is acceptable even for a large
flow of updates.

As a result, I have not found a better solution than simply replace
the minutes by the seconds.

>> commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 06:51:04 +0400
>>
>>       EXTENSION - lmdb: more usefull info from mdb_stat tool.
>
>
> A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not
> the "last" reader. I'm not convinced of the value of this patch, since you
> can already view the readers list.

I am agree then "tail" is a best choice.
But the main value of this patch is not to show a txn of oldest
reader, but to show an info about pages usage.
Especially the amount of pages which are "blocked" by oldest (laggard)
reader, and how much pages are actually available.

> --
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Thank you in advance.
BR.
Leonid Yuriev.

Comment 2 Leonid Yuriev 2014-10-03 20:17:54 UTC
2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
>> commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
>> Author: Leo Yuriev <leo@yuriev.ru>
>> Date:   2014-09-20 06:51:04 +0400
>>
>>       EXTENSION - lmdb: more usefull info from mdb_stat tool.
>
>
> A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not
> the "last" reader. I'm not convinced of the value of this patch, since you
> can already view the readers list.

I am agree that "tail" is NOT a best choice.
But the main value of this patch is not to show a txn of oldest
reader, but to show an info about pages usage.
Especially the amount of pages which are "blocked" by oldest (laggard)
reader, and how much pages are actually available.

2014-10-04 0:04 GMT+04:00 Леонид Юрьев <leo@yuriev.ru>:
> Fwd: (ITS#7841) high disk utilization
>
> 2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
>>> commit 841059330fd44769e93eb4b937c3ce42654fad6f
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-20 07:16:15 +0400
>>>
>>>       BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
>>> write,
>>>                 before the data pages would be synchronized.
>>>
>>>       Without locking the meta-pages may be writen by OS before other
>>> data,
>>>       in this case database would be inconsistent.
>>
>>
>> Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
>> that risk is already documented.
>
> We are using the combination:
>   envflags writemap nosync lifo
>   checkpoint 0 1
>
> If the checkpoint is set in seconds, it gives us the assurance
> consistent state database on disk.
> However, without this patch meta-pages can be written by the kernel
> before the data.
>
> In fact, for a full guarantee in case of death slapd process,
> meta-page should be written explicitly.
> But it requires a lot of changes and I do not do that.
>
>>> commit 0c168d0e63ed78d13df3fc8a42f3667335678639
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-20 10:13:28 +0400
>>>
>>>       FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
>>>
>>>       Reclaim FreeDB in LIFO order - this is a main feature.
>>>       Also aim to coalesce small FreeDFB records.
>>
>> Will spend more time looking at this closer.
>
> I would be suggested, but do not insist, review this patch on github.
>
>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-19 22:47:19 +0400
>>>
>>>       BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>
>>>       Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>       in this case database would be inconsistent.
>>>
>>>       Check-and-retry if lead txn-id changed during flushing data in
>>> mdb_sync_env().
>>
>> Probably could simplify this, just obtain the write mutex unconditionally,
>> then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
>> - if that's set, then do no locking at all.
>
> I did so for reasons of performance and less a lock retention time.
>
> Retries will be if there an intensive flow of changes.
> In this case it will be a lot of updated pages, the record which will
> take some time.
>
> However, in subsequent iterations (if a transactions had committed
> while there was a record),
> the modified pages will be much fewer, and the sync will be quick.
>
> Thus (and it was seen in tests) even when a substantial amount of the
> transactions,
> usually only two iterations of the cycle,
> without locking and flow of changes are not suspended.
>
>>> commit 147f41a8110f28456bc32123bde86d47183f9c0a
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-04 16:01:15 +0400
>>>
>>>       FEATURE - lmdb: implementation of "checkpoint kbytes".
>>>
>>>       Force flush when volume of the changes reached a configurable
>>> threshold.
>>
>>
>> Probably OK. Needs some typographical cleanup. Not sure "syncbytes" is a
>> good name.
>
> Agree.
> I just took the first choice and try to retaining the style.
> Ideas?
>
>>> commit fb82a0b688f4c31313d0790415feda8aaa18651c
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-04 15:18:16 +0400
>>>
>>>       CHANGE - lmdb-backend: checkpoint-interval in seconds instead of
>>> minutes.
>>
>>
>> Gratuitous change. We used minutes since the BDB backend uses minutes, and
>> the intention was to maintain parallel functionality. What's the
>> justification for this change?
>
> As I had wrote above, we are using the combination:
>   envflags writemap nosync lifo
>   checkpoint 0 1
>
> If the interval is specified in minutes, then it can not be set less
> than one minute.
> But it's too big amount of time to allow lost the updates.
>
> However, setting the synchronization interval of one second,
> we reduce the amount of losses in the event of an accident to an
> acceptable level,
> while the load on the storage system is acceptable even for a large
> flow of updates.
>
> As a result, I have not found a better solution than simply replace
> the minutes by the seconds.
>
>>> commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
>>> Author: Leo Yuriev <leo@yuriev.ru>
>>> Date:   2014-09-20 06:51:04 +0400
>>>
>>>       EXTENSION - lmdb: more usefull info from mdb_stat tool.
>>
>>
>> A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not
>> the "last" reader. I'm not convinced of the value of this patch, since you
>> can already view the readers list.
>
> I am agree then "tail" is a best choice.
> But the main value of this patch is not to show a txn of oldest
> reader, but to show an info about pages usage.
> Especially the amount of pages which are "blocked" by oldest (laggard)
> reader, and how much pages are actually available.
>
>> --
>>   -- Howard Chu
>>   CTO, Symas Corp.           http://www.symas.com
>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>
> Thank you in advance.
> BR.
> Leonid Yuriev.

Comment 3 Leonid Yuriev 2014-10-06 08:51:30 UTC
https://github.com/leo-yuriev/openldap-lmdb-challenge/pull/1

Best regards.
Leonid Yuriev.

Comment 4 Howard Chu 2014-10-19 05:09:23 UTC
Леонид Юрьев wrote:
> 2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
> 2014-10-04 0:04 GMT+04:00 Леонид Юрьев <leo@yuriev.ru>:

>> 2014-10-03 3:13 GMT+04:00 Howard Chu <hyc@symas.com>:
>>>> commit 841059330fd44769e93eb4b937c3ce42654fad6f
>>>> Author: Leo Yuriev <leo@yuriev.ru>
>>>> Date:   2014-09-20 07:16:15 +0400
>>>>
>>>>        BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
>>>> write,
>>>>                  before the data pages would be synchronized.
>>>>
>>>>        Without locking the meta-pages may be writen by OS before other
>>>> data,
>>>>        in this case database would be inconsistent.
>>>
>>>
>>> Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
>>> that risk is already documented.
>>
>> We are using the combination:
>>    envflags writemap nosync lifo
>>    checkpoint 0 1
>>
>> If the checkpoint is set in seconds, it gives us the assurance
>> consistent state database on disk.
>> However, without this patch meta-pages can be written by the kernel
>> before the data.
>>
>> In fact, for a full guarantee in case of death slapd process,
>> meta-page should be written explicitly.

No, the DB can never go inconsistent due to a process crashing - the pages in 
OS cache are always correct. It can only go inconsistent if the OS crashes and 
a proper sync has not occurred.

>> But it requires a lot of changes and I do not do that.

>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>> Author: Leo Yuriev <leo@yuriev.ru>
>>>> Date:   2014-09-19 22:47:19 +0400
>>>>
>>>>        BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>
>>>>        Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>>        in this case database would be inconsistent.
>>>>
>>>>        Check-and-retry if lead txn-id changed during flushing data in
>>>> mdb_sync_env().

Fundamentally, you are trying to make an inherently unsafe configuration 
"safer", but it's impossible. Assume you have mlock'd the meta pages into 
memory, so the OS never flushes them itself any more, and you're running with 
NOSYNC. That means, within 3 transactions, the data pages on disk will be out 
of sync with the meta pages on disk. If the OS crashes at that point, the 
entire DB will be lost.

The only way to make this mode of operation somewhat safe is to defer 
reclaiming pages for even longer. E.g., instead of halting at current_txnid - 
3, halt at current_txnid - 22, in which case the data pointed to by the 
on-disk meta pages cannot get obsolete until 20 transactions have occurred.

Note that in combination with your LIFO patch, it's pretty much guaranteed 
that the on-disk meta pages will be useless after only 2 un-sync'd transactions.

>>> Probably could simplify this, just obtain the write mutex unconditionally,
>>> then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
>>> - if that's set, then do no locking at all.
>>
>> I did so for reasons of performance and less a lock retention time.
>>
>> Retries will be if there an intensive flow of changes.
>> In this case it will be a lot of updated pages, the record which will
>> take some time.
>>
>> However, in subsequent iterations (if a transactions had committed
>> while there was a record),
>> the modified pages will be much fewer, and the sync will be quick.
>>
>> Thus (and it was seen in tests) even when a substantial amount of the
>> transactions,
>> usually only two iterations of the cycle,
>> without locking and flow of changes are not suspended.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 5 Leonid Yuriev 2014-10-19 21:27:46 UTC
>>> We are using the combination:
>>>    envflags writemap nosync lifo
>>>    checkpoint 0 1
>>>
>>> If the checkpoint is set in seconds, it gives us the assurance
>>> consistent state database on disk.
>>> However, without this patch meta-pages can be written by the kernel
>>> before the data.
>>>
>>> In fact, for a full guarantee in case of death slapd process,
>>> meta-page should be written explicitly.
>
>
> No, the DB can never go inconsistent due to a process crashing - the pages
> in OS cache are always correct. It can only go inconsistent if the OS
> crashes and a proper sync has not occurred.

Yes, Howard, you are right.
But apparently I need to be more precise.
Talking about "death" of slapd, I meant all the reasons, including power off.

For example, a power-off case:
- The main power is turned off and the system switches to the UPS.
- Given the notice, OS starts an emergency stop processes.
- For some reason (does not have enough time to stop) slapd receives SIGKILL.
- OS tries to write mmap-region of the DB-file and begins with the
lower address.
- let the meta-pages has written completely, but for the rest of the
data is not enough battery power.
- Now DB is completely destroyed on the disk.

To avoid this, the meta-pages should not be included in the rw-mapped
region, and should be written explicity after a data pages.


>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>>> Author: Leo Yuriev <leo@yuriev.ru>
>>>>> Date:   2014-09-19 22:47:19 +0400
>>>>>
>>>>>        BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>>
>>>>>        Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>>>        in this case database would be inconsistent.
>>>>>
>>>>>        Check-and-retry if lead txn-id changed during flushing data in
>>>>> mdb_sync_env().
>
> Fundamentally, you are trying to make an inherently unsafe configuration
> "safer", but it's impossible. Assume you have mlock'd the meta pages into
> memory, so the OS never flushes them itself any more, and you're running
> with NOSYNC. That means, within 3 transactions, the data pages on disk will
> be out of sync with the meta pages on disk. If the OS crashes at that point,
> the entire DB will be lost.

Not a problem.
I had explained above - we should write meta-pages explicitly after
the data sync.
But also we should not perform reclaiming ahead of the last checkpoint.

> The only way to make this mode of operation somewhat safe is to defer
> reclaiming pages for even longer. E.g., instead of halting at current_txnid
> - 3, halt at current_txnid - 22, in which case the data pointed to by the
> on-disk meta pages cannot get obsolete until 20 transactions have occurred.
>
> Note that in combination with your LIFO patch, it's pretty much guaranteed
> that the on-disk meta pages will be useless after only 2 un-sync'd
> transactions.

Yes, Howard, you are right.
But I think there is confusion in the discussion because of mixing of
LIFO-feature and changes for checkpoints consistency in a NOSYNC and
WRITEMAP+NOSYNC modes.
For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
a 'volaile' related 7969,7970,7971).
My opinion - it is a flaw, and no reason to don't fix it.

Continuing the conversation about checkpoints in a LIFO context.
I saw the problem, that you specified, and thinking over its solution,
but have not yet found "golden ratio".
And since we are having a serious problem with syncrepl, then I put
off this task with an excuse "LIFO-patch not does worse than it was."

In general, we should do not reclaim anything ahead of the txn, that
is synced to the disk (let this be named a R-rule).
To do so we need a second field like mti_txnid, but which will be
update only at the end of mdb_env_write_meta().
Finally we should start search in mdb_find_oldest() from value of this
new field instead of the current txn number.
This seems to will be work fine.

However, I stopped on a reasoning - about the purpose of the
checkpoints, about design LMDB as a product, about the expectations of
the user and the necessary configuration parameters:
- checkpoints are needed ONLY in nosync modes;
- if the user does NOT activate the checkpoints, he do not care about
consistency;
- but if it is turned on, we MUST provide consistency on the checkpoints;
- otherwise a checkpoints feature is thoughtless and should be REMOVED.
Therefore implementation of checkpoints & reclaiming should be updated
to conform to the "R-rule", that noted above.

From this point of view a LIFO-feature also should be refined, but
nevertheless can be very useful.
- SYNC mode = takes a benefit from storage with write-back cache
(assume powered by battery).
- ASYNC/NOSYNC without checkpoint = significant reduction of
committed/dirty pages and thereby much less write-iops.
- ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case.

Total of all the above - I think first we need to fix a reclaiming or
delete the checkpoints, and then I will complete LIFO.

>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Thank for conversation.
Leonid.

Comment 6 Leonid Yuriev 2015-01-12 16:46:21 UTC
For informaion only - Nowadays 'lifo' and 'coalesce' features
implemented in ReOpenLDAP fork.
1) lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
https://github.com/ReOpen/ReOpenLDAP/commit/829c2063b602238b5c93ea36a981de3d0d7994bc
2) lmdb-backend: support config for 'lifo' and 'coalesce' envflags.
https://github.com/ReOpen/ReOpenLDAP/commit/08b4a41b5b837548444ef0fef7614940c41d882a

With the couple of issues:
1)  lmdb in 'writemap' mode may inconsistent even with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/1
2) lifo feature should be synchonized with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/2

However currently it gives a reasonable boost (5-10 times) of
write-performance in our use case.

Leonid.

2014-10-20 0:27 GMT+03:00 Леонид Юрьев <leo@yuriev.ru>:
>>>> We are using the combination:
>>>>    envflags writemap nosync lifo
>>>>    checkpoint 0 1
>>>>
>>>> If the checkpoint is set in seconds, it gives us the assurance
>>>> consistent state database on disk.
>>>> However, without this patch meta-pages can be written by the kernel
>>>> before the data.
>>>>
>>>> In fact, for a full guarantee in case of death slapd process,
>>>> meta-page should be written explicitly.
>>
>>
>> No, the DB can never go inconsistent due to a process crashing - the pages
>> in OS cache are always correct. It can only go inconsistent if the OS
>> crashes and a proper sync has not occurred.
>
> Yes, Howard, you are right.
> But apparently I need to be more precise.
> Talking about "death" of slapd, I meant all the reasons, including power off.
>
> For example, a power-off case:
> - The main power is turned off and the system switches to the UPS.
> - Given the notice, OS starts an emergency stop processes.
> - For some reason (does not have enough time to stop) slapd receives SIGKILL.
> - OS tries to write mmap-region of the DB-file and begins with the
> lower address.
> - let the meta-pages has written completely, but for the rest of the
> data is not enough battery power.
> - Now DB is completely destroyed on the disk.
>
> To avoid this, the meta-pages should not be included in the rw-mapped
> region, and should be written explicity after a data pages.
>
>
>>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>>>> Author: Leo Yuriev <leo@yuriev.ru>
>>>>>> Date:   2014-09-19 22:47:19 +0400
>>>>>>
>>>>>>        BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>>>
>>>>>>        Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>>>>        in this case database would be inconsistent.
>>>>>>
>>>>>>        Check-and-retry if lead txn-id changed during flushing data in
>>>>>> mdb_sync_env().
>>
>> Fundamentally, you are trying to make an inherently unsafe configuration
>> "safer", but it's impossible. Assume you have mlock'd the meta pages into
>> memory, so the OS never flushes them itself any more, and you're running
>> with NOSYNC. That means, within 3 transactions, the data pages on disk will
>> be out of sync with the meta pages on disk. If the OS crashes at that point,
>> the entire DB will be lost.
>
> Not a problem.
> I had explained above - we should write meta-pages explicitly after
> the data sync.
> But also we should not perform reclaiming ahead of the last checkpoint.
>
>> The only way to make this mode of operation somewhat safe is to defer
>> reclaiming pages for even longer. E.g., instead of halting at current_txnid
>> - 3, halt at current_txnid - 22, in which case the data pointed to by the
>> on-disk meta pages cannot get obsolete until 20 transactions have occurred.
>>
>> Note that in combination with your LIFO patch, it's pretty much guaranteed
>> that the on-disk meta pages will be useless after only 2 un-sync'd
>> transactions.
>
> Yes, Howard, you are right.
> But I think there is confusion in the discussion because of mixing of
> LIFO-feature and changes for checkpoints consistency in a NOSYNC and
> WRITEMAP+NOSYNC modes.
> For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
> a 'volaile' related 7969,7970,7971).
> My opinion - it is a flaw, and no reason to don't fix it.
>
> Continuing the conversation about checkpoints in a LIFO context.
> I saw the problem, that you specified, and thinking over its solution,
> but have not yet found "golden ratio".
> And since we are having a serious problem with syncrepl, then I put
> off this task with an excuse "LIFO-patch not does worse than it was."
>
> In general, we should do not reclaim anything ahead of the txn, that
> is synced to the disk (let this be named a R-rule).
> To do so we need a second field like mti_txnid, but which will be
> update only at the end of mdb_env_write_meta().
> Finally we should start search in mdb_find_oldest() from value of this
> new field instead of the current txn number.
> This seems to will be work fine.
>
> However, I stopped on a reasoning - about the purpose of the
> checkpoints, about design LMDB as a product, about the expectations of
> the user and the necessary configuration parameters:
> - checkpoints are needed ONLY in nosync modes;
> - if the user does NOT activate the checkpoints, he do not care about
> consistency;
> - but if it is turned on, we MUST provide consistency on the checkpoints;
> - otherwise a checkpoints feature is thoughtless and should be REMOVED.
> Therefore implementation of checkpoints & reclaiming should be updated
> to conform to the "R-rule", that noted above.
>
> From this point of view a LIFO-feature also should be refined, but
> nevertheless can be very useful.
> - SYNC mode = takes a benefit from storage with write-back cache
> (assume powered by battery).
> - ASYNC/NOSYNC without checkpoint = significant reduction of
> committed/dirty pages and thereby much less write-iops.
> - ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case.
>
> Total of all the above - I think first we need to fix a reclaiming or
> delete the checkpoints, and then I will complete LIFO.
>
>>   -- Howard Chu
>>   CTO, Symas Corp.           http://www.symas.com
>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>
> Thank for conversation.
> Leonid.

Comment 7 Leonid Yuriev 2020-06-07 20:39:55 UTC
MDBX_LIFORECLAIM implemented & checked in the libmdbx project.