[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes



For informaion only - Nowadays 'lifo' and 'coalesce' features
implemented in ReOpenLDAP fork.
1) lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
https://github.com/ReOpen/ReOpenLDAP/commit/829c2063b602238b5c93ea36a981de3=
d0d7994bc
2) lmdb-backend: support config for 'lifo' and 'coalesce' envflags.
https://github.com/ReOpen/ReOpenLDAP/commit/08b4a41b5b837548444ef0fef761494=
0c41d882a

With the couple of issues:
1)  lmdb in 'writemap' mode may inconsistent even with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/1
2) lifo feature should be synchonized with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/2

However currently it gives a reasonable boost (5-10 times) of
write-performance in our use case.

Leonid.

2014-10-20 0:27 GMT+03:00 =D0=9B=D0=B5=D0=BE=D0=BD=D0=B8=D0=B4 =D0=AE=D1=80=
=D1=8C=D0=B5=D0=B2 <leo@yuriev.ru>:
>>>> We are using the combination:
>>>>    envflags writemap nosync lifo
>>>>    checkpoint 0 1
>>>>
>>>> If the checkpoint is set in seconds, it gives us the assurance
>>>> consistent state database on disk.
>>>> However, without this patch meta-pages can be written by the kernel
>>>> before the data.
>>>>
>>>> In fact, for a full guarantee in case of death slapd process,
>>>> meta-page should be written explicitly.
>>
>>
>> No, the DB can never go inconsistent due to a process crashing - the pag=
es
>> in OS cache are always correct. It can only go inconsistent if the OS
>> crashes and a proper sync has not occurred.
>
> Yes, Howard, you are right.
> But apparently I need to be more precise.
> Talking about "death" of slapd, I meant all the reasons, including power =
off.
>
> For example, a power-off case:
> - The main power is turned off and the system switches to the UPS.
> - Given the notice, OS starts an emergency stop processes.
> - For some reason (does not have enough time to stop) slapd receives SIGK=
ILL.
> - OS tries to write mmap-region of the DB-file and begins with the
> lower address.
> - let the meta-pages has written completely, but for the rest of the
> data is not enough battery power.
> - Now DB is completely destroyed on the disk.
>
> To avoid this, the meta-pages should not be included in the rw-mapped
> region, and should be written explicity after a data pages.
>
>
>>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>>>> Author: Leo Yuriev <leo@yuriev.ru>
>>>>>> Date:   2014-09-19 22:47:19 +0400
>>>>>>
>>>>>>        BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>>>
>>>>>>        Meta-pages may be updated during data-syncing in mdb_sync_env=
(),
>>>>>>        in this case database would be inconsistent.
>>>>>>
>>>>>>        Check-and-retry if lead txn-id changed during flushing data i=
n
>>>>>> mdb_sync_env().
>>
>> Fundamentally, you are trying to make an inherently unsafe configuration
>> "safer", but it's impossible. Assume you have mlock'd the meta pages int=
o
>> memory, so the OS never flushes them itself any more, and you're running
>> with NOSYNC. That means, within 3 transactions, the data pages on disk w=
ill
>> be out of sync with the meta pages on disk. If the OS crashes at that po=
int,
>> the entire DB will be lost.
>
> Not a problem.
> I had explained above - we should write meta-pages explicitly after
> the data sync.
> But also we should not perform reclaiming ahead of the last checkpoint.
>
>> The only way to make this mode of operation somewhat safe is to defer
>> reclaiming pages for even longer. E.g., instead of halting at current_tx=
nid
>> - 3, halt at current_txnid - 22, in which case the data pointed to by th=
e
>> on-disk meta pages cannot get obsolete until 20 transactions have occurr=
ed.
>>
>> Note that in combination with your LIFO patch, it's pretty much guarante=
ed
>> that the on-disk meta pages will be useless after only 2 un-sync'd
>> transactions.
>
> Yes, Howard, you are right.
> But I think there is confusion in the discussion because of mixing of
> LIFO-feature and changes for checkpoints consistency in a NOSYNC and
> WRITEMAP+NOSYNC modes.
> For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
> a 'volaile' related 7969,7970,7971).
> My opinion - it is a flaw, and no reason to don't fix it.
>
> Continuing the conversation about checkpoints in a LIFO context.
> I saw the problem, that you specified, and thinking over its solution,
> but have not yet found "golden ratio".
> And since we are having a serious problem with syncrepl, then I put
> off this task with an excuse "LIFO-patch not does worse than it was."
>
> In general, we should do not reclaim anything ahead of the txn, that
> is synced to the disk (let this be named a R-rule).
> To do so we need a second field like mti_txnid, but which will be
> update only at the end of mdb_env_write_meta().
> Finally we should start search in mdb_find_oldest() from value of this
> new field instead of the current txn number.
> This seems to will be work fine.
>
> However, I stopped on a reasoning - about the purpose of the
> checkpoints, about design LMDB as a product, about the expectations of
> the user and the necessary configuration parameters:
> - checkpoints are needed ONLY in nosync modes;
> - if the user does NOT activate the checkpoints, he do not care about
> consistency;
> - but if it is turned on, we MUST provide consistency on the checkpoints;
> - otherwise a checkpoints feature is thoughtless and should be REMOVED.
> Therefore implementation of checkpoints & reclaiming should be updated
> to conform to the "R-rule", that noted above.
>
> From this point of view a LIFO-feature also should be refined, but
> nevertheless can be very useful.
> - SYNC mode =3D takes a benefit from storage with write-back cache
> (assume powered by battery).
> - ASYNC/NOSYNC without checkpoint =3D significant reduction of
> committed/dirty pages and thereby much less write-iops.
> - ASYNC/NOSYNC with checkpoint =3D seems to same as a SYNC case.
>
> Total of all the above - I think first we need to fix a reclaiming or
> delete the checkpoints, and then I will complete LIFO.
>
>>   -- Howard Chu
>>   CTO, Symas Corp.           http://www.symas.com
>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>
> Thank for conversation.
> Leonid.