[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: (ITS#9017) Improving performance of commit sync in Windows



--_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

For the sake of putting this in the email thread (other code discussion in =
GitHub), here is the latest squashed commit of the proposed patch (with the=
 on-demand, retained overlapped array to reduce re-malloc and opening event=
 handles): https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d4=
53aab75ee222072b990f

Thanks,
Kris

From: Kris Zyp
Sent: April 30, 2019 12:43 PM
To: Howard Chu; openldap-its@OpenLDAP.org
Subject: RE: (ITS#9017) Improving performance of commit sync in Windows

> What is the point of using writemap mode if you still need to use WriteFi=
le
> on every individual page?

As I understood from the documentation, and have observed, using writemap m=
ode is faster (and uses less temporary memory) because it doesn=E2=80=99t r=
equire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fe=
wer mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e=
fficient, that in sync-mode, it takes enormous transactions before the time=
 spent allocating and creating the dirty pages with the updated b-tree is a=
nywhere even remotely close to the time it takes to wait for disk flushing,=
 even with an SSD. But the more pertinent question is efficiency, and measu=
ring CPU cycles rather than time spent (efficiency is more important than j=
ust time spent). When I ran my tests this morning of 100 (sync) transaction=
s with 100 puts per transaction, times varied quite a bit, but it seemed li=
ke running with writemap enabled typically averages about 500ms of CPU and =
with writemap disabled it typically averages around 600ms. Not a huge diffe=
rence, but still definitely worthwhile, I think.

Caveat emptor: Measuring LMDB performance with sync interactions on Windows=
 is one of the most frustratingly erratic things to measure. It is sunny ou=
tside right now, times could be different when it starts raining later, but=
, this is what I saw this morning...

> What is the performance difference between your patch using writemap, and=
 just
> not using writemap in the first place?

Running 1000 sync transactions on 3GB db with a single put per transaction,=
 without writemap map, without the patch took about 60 seconds. And it took=
 about 1 second with the patch with writemap mode enabled! (there is no sig=
nificant difference in sync times with writemap enabled or disabled with th=
e patch.) So the difference was huge in my test. And not only that, without=
 the patch, the CPU usage was actually _higher_ during that 60 seconds (clo=
se to 100% of a core) than during the execution with the patch for one seco=
nd (close to 50%). =C2=A0Anyway, there are certainly tests I have run where=
 the differences are not as large (doing small commits on large dbs accentu=
ates the differences), but the patch always seems to win. It could also be =
that my particular configuration causes bigger differences (on an SSD drive=
, and maybe a more fragmented file?).

Anyway, I added error handling for the malloc, and fixed/changed the other =
things you suggested. Be happy to make any other changes you want. The upda=
ted patch is here:
https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b=
62094acde

> OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED));
> Probably this ought to just be pre-allocated based on the maximum number =
of dirty pages a txn allows.

I wasn=E2=80=99t sure I understood this comment. Are you suggesting we mall=
oc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain it=
 for the life of the environment? I think that is 4MB, if my math is right,=
 which seems like a lot of memory to keep allocated (we usually have a lot =
of open environments). If the goal is to reduce the number of mallocs, how =
about we retain the OVERLAPPED array, and only free and re-malloc if the pr=
evious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unn=
ecessary allocation, and we only malloc when there is a bigger transaction =
than any previous. I put this together in a separate commit, as I wasn=E2=
=80=99t sure if this what you wanted (can squash if you prefer): https://gi=
thub.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40

Thank you for the review!=20

Thanks,
Kris

From: Howard Chu
Sent: April 30, 2019 7:12 AM
To: kriszyp@gmail.com; openldap-its@OpenLDAP.org
Subject: Re: (ITS#9017) Improving performance of commit sync in Windows

kriszyp@gmail.com wrote:
> Full_Name: Kristopher William Zyp
> Version: LMDB 0.9.23
> OS: Windows
> URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74=
a0ab9332b7fc4ce9
> Submission from: (NULL) (71.199.6.148)
>=20
>=20
> We have seen very poor performance on the sync of commits on large databa=
ses in
> Windows. On databases with 2GB of data, in writemap mode, the sync of eve=
n small
> commits is consistently well over 100ms (without writemap it is faster, b=
ut
> still slow). It is expected that a sync should take some time while waiti=
ng for
> disk confirmation of the writes, but more concerning is that these sync
> operations (in writemap mode) are instead dominated by nearly 100% system=
 CPU
> utilization, so operations that requires sub-millisecond b-tree update
> operations are then dominated by very large amounts of system CPU cycles =
during
> the sync phase.
>=20
> I think that the fundamental problem is that FlushViewOfFile seems to be =
an O(n)
> operation where n is the size of the file (or map). I presume that Window=
s is
> scanning the entire map/file for dirty pages to flush, I'm guessing becau=
se it
> doesn't have an internal index of all the dirty pages for every file/map-=
view in
> the OS disk cache. Therefore, the turns into an extremely expensive, CPU-=
bound
> operation to find the dirty pages for (large file) and initiate their wri=
tes,
> which, of course, is contrary to the whole goal of a scalable database sy=
stem.
> And FlushFileBuffers is also relatively slow as well. We have attempted t=
o batch
> as many operations into single transaction as possible, but this is still=
 a very
> large overhead.
>=20
> The Windows docs for FlushFileBuffers itself warns about the inefficienci=
es of
> this function (https://docs.microsoft.com/en-us/windows/desktop/api/filea=
pi/nf-fileapi-flushfilebuffers).
> Which also points to the solution: it is much faster to write out the dir=
ty
> pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH)=
.
>=20
> The associated patch
> (https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab=
9332b7fc4ce9)
> is my attempt at implementing this solution, for Windows. Fortunately, wi=
th the
> design of LMDB, this is relatively straightforward. LMDB already supports
> writing out dirty pages with WriteFile calls. I added a write-through han=
dle for
> sending these writes directly to disk. I then made that file-handle
> overlapped/asynchronously, so all the writes for a commit could be starte=
d in
> overlap mode, and (at least theoretically) transfer in parallel to the dr=
ive and
> then used GetOverlappedResult to wait for the completion. So basically
> mdb_page_flush becomes the sync. I extended the writing of dirty pages th=
rough
> WriteFile to writemap mode as well (for writing meta too), so that WriteF=
ile
> with write-through can be used to flush the data without ever needing to =
call
> FlushViewOfFile or FlushFileBuffers. I also implemented support for write
> gathering in writemap mode where contiguous file positions infers contigu=
ous
> memory (by tracking the starting position with wdp and writing contiguous=
 pages
> in single operations). Sorting of the dirty list is maintained even in wr=
itemap
> mode for this purpose.

What is the point of using writemap mode if you still need to use WriteFile
on every individual page?

> The performance benefits of this patch, in my testing, are considerable. =
Writing
> out/syncing transactions is typically over 5x faster in writemap mode, an=
d 2x
> faster in standard mode. And perhaps more importantly (especially in envi=
ronment
> with many threads/processes), the efficiency benefits are even larger,
> particularly in writemap mode, where there can be a 50-100x reduction in =
the
> system CPU usage by using this patch. This brings windows performance wit=
h
> sync'ed transactions in LMDB back into the range of "lightning" performan=
ce :).

What is the performance difference between your patch using writemap, and j=
ust
not using writemap in the first place?

--=20
=C2=A0=C2=A0-- Howard Chu
=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 http://www.symas.com
=C2=A0 Director, Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 http://highlandsun.co=
m/hyc/
=C2=A0 Chief Architect, OpenLDAP=C2=A0 http://www.openldap.org/project/



--_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"

<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc=
hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of=
fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40";><head><meta ht=
tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name=
=3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Consolas;
	panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
	{font-family:"Segoe UI";
	panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
span.blob-code-inner
	{mso-style-name:blob-code-inner;}
span.pl-c1
	{mso-style-name:pl-c1;}
span.pl-k
	{mso-style-name:pl-k;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style></head><body lang=3DEN-CA link=3Dblue vlink=3D"#954F72"><div cla=
ss=3DWordSection1><p class=3DMsoNormal>For the sake of putting this in the =
email thread (other code discussion in GitHub), here is the latest squashed=
 commit of the proposed patch (with the on-demand, retained overlapped arra=
y to reduce re-malloc and opening event handles): https://github.com/kriszy=
p/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f</p><p class=3DM=
soNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Thanks,<br>Kris</p><p cl=
ass=3DMsoNormal><o:p>&nbsp;</o:p></p><div style=3D'mso-element:para-border-=
div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><=
p class=3DMsoNormal style=3D'border:none;padding:0cm'><b>From: </b><a href=
=3D"mailto:kriszyp@gmail.com";>Kris Zyp</a><br><b>Sent: </b>April 30, 2019 1=
2:43 PM<br><b>To: </b><a href=3D"mailto:hyc@symas.com";>Howard Chu</a>; <a h=
ref=3D"mailto:openldap-its@OpenLDAP.org";>openldap-its@OpenLDAP.org</a><br><=
b>Subject: </b>RE: (ITS#9017) Improving performance of commit sync in Windo=
ws</p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>=
&gt; What is the point of using writemap mode if you still need to use Writ=
eFile<o:p></o:p></p><p class=3DMsoNormal>&gt; on every individual page?<o:p=
></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>A=
s I understood from the documentation, and have observed, using writemap mo=
de is faster (and uses less temporary memory) because it doesn=E2=80=99t re=
quire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses few=
er mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and ef=
ficient, that in sync-mode, it takes enormous transactions before the time =
spent allocating and creating the dirty pages with the updated b-tree is an=
ywhere even remotely close to the time it takes to wait for disk flushing, =
even with an SSD. But the more pertinent question is efficiency, and measur=
ing CPU cycles rather than time spent (efficiency is more important than ju=
st time spent). When I ran my tests this morning of 100 (sync) transactions=
 with 100 puts per transaction, times varied quite a bit, but it seemed lik=
e running with writemap enabled typically averages about 500ms of CPU and w=
ith writemap disabled it typically averages around 600ms. Not a huge differ=
ence, but still definitely worthwhile, I think.<o:p></o:p></p><p class=3DMs=
oNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Caveat emptor: Measuring =
LMDB performance with sync interactions on Windows is one of the most frust=
ratingly erratic things to measure. It is sunny outside right now, times co=
uld be different when it starts raining later, but, this is what I saw this=
 morning...<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p clas=
s=3DMsoNormal>&gt; What is the performance difference between your patch us=
ing writemap, and just<o:p></o:p></p><p class=3DMsoNormal>&gt; not using wr=
itemap in the first place?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</=
o:p></p><p class=3DMsoNormal>Running 1000 sync transactions on 3GB db with =
a single put per transaction, without writemap map, without the patch took =
about 60 seconds. And it took about 1 second with the patch with writemap m=
ode enabled! (there is no significant difference in sync times with writema=
p enabled or disabled with the patch.) So the difference was huge in my tes=
t. And not only that, without the patch, the CPU usage was actually _<i>hig=
her</i>_ during that 60 seconds (close to 100% of a core) than during the e=
xecution with the patch for one second (close to 50%). &nbsp;Anyway, there =
are certainly tests I have run where the differences are not as large (doin=
g small commits on large dbs accentuates the differences), but the patch al=
ways seems to win. It could also be that my particular configuration causes=
 bigger differences (on an SSD drive, and maybe a more fragmented file?).<o=
:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal=
>Anyway, I added error handling for the malloc, and fixed/changed the other=
 things you suggested. Be happy to make any other changes you want. The upd=
ated patch is here:<o:p></o:p></p><p class=3DMsoNormal>https://github.com/k=
riszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde<o:p></o:p>=
</p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>&gt;<spa=
n class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Consol=
as;color:#24292E'> OVERLAPPED* ov =3D </span></span><span class=3Dpl-c1><sp=
an style=3D'font-size:9.0pt;font-family:Consolas;color:#005CC5'>malloc</spa=
n></span><span class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-=
family:Consolas;color:#24292E'>((pagecount - keep) * </span></span><span cl=
ass=3Dpl-k><span style=3D'font-size:9.0pt;font-family:Consolas;color:#D73A4=
9'>sizeof</span></span><span class=3Dblob-code-inner><span style=3D'font-si=
ze:9.0pt;font-family:Consolas;color:#24292E'>(OVERLAPPED));</span></span><s=
pan class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Cons=
olas;color:#24292E'><o:p></o:p></span></span></p><p class=3DMsoNormal><span=
 class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Consola=
s;color:#24292E'>&gt; </span></span><span style=3D'font-size:10.5pt;font-fa=
mily:"Segoe UI",sans-serif;color:#24292E;background:white'>Probably this ou=
ght to just be pre-allocated based on the maximum number of dirty pages a t=
xn allows.</span><span style=3D'font-size:10.5pt;font-family:"Segoe UI",san=
s-serif;background:white'><o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;b=
ackground:white'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span sty=
le=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;back=
ground:white'>I wasn=E2=80=99t sure I understood this comment. Are you sugg=
esting we </span>malloc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each envir=
onment, and retain it for the life of the environment? I think that is 4MB,=
 if my math is right, which seems like a lot of memory to keep allocated (w=
e usually have a lot of open environments). If the goal is to reduce the nu=
mber of mallocs, how about we retain the OVERLAPPED array, and only free an=
d re-malloc if the previous allocation wasn=E2=80=99t large enough? Then th=
ere isn=E2=80=99t unnecessary allocation, and we only malloc when there is =
a bigger transaction than any previous. I put this together in a separate c=
ommit, as I wasn=E2=80=99t sure if this what you wanted (can squash if you =
prefer): https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e7897=
46a17a4b2adefaac40<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p>=
<p class=3DMsoNormal>Thank you for the review! <span style=3D'font-size:10.=
5pt;font-family:"Segoe UI",sans-serif;color:#24292E;background:white'><o:p>=
</o:p></span></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNo=
rmal>Thanks,<br>Kris<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></=
p><div style=3D'border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0c=
m 0cm 0cm'><p class=3DMsoNormal><b>From: </b><a href=3D"mailto:hyc@symas.co=
m">Howard Chu</a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a h=
ref=3D"mailto:kriszyp@gmail.com";>kriszyp@gmail.com</a>; <a href=3D"mailto:o=
penldap-its@OpenLDAP.org">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>=
Re: (ITS#9017) Improving performance of commit sync in Windows<o:p></o:p></=
p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>kris=
zyp@gmail.com wrote:<o:p></o:p></p><p class=3DMsoNormal>&gt; Full_Name: Kri=
stopher William Zyp<o:p></o:p></p><p class=3DMsoNormal>&gt; Version: LMDB 0=
.9.23<o:p></o:p></p><p class=3DMsoNormal>&gt; OS: Windows<o:p></o:p></p><p =
class=3DMsoNormal>&gt; URL: https://github.com/kriszyp/node-lmdb/commit/7ff=
525ae57684a163d32af74a0ab9332b7fc4ce9<o:p></o:p></p><p class=3DMsoNormal>&g=
t; Submission from: (NULL) (71.199.6.148)<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DM=
soNormal>&gt; We have seen very poor performance on the sync of commits on =
large databases in<o:p></o:p></p><p class=3DMsoNormal>&gt; Windows. On data=
bases with 2GB of data, in writemap mode, the sync of even small<o:p></o:p>=
</p><p class=3DMsoNormal>&gt; commits is consistently well over 100ms (with=
out writemap it is faster, but<o:p></o:p></p><p class=3DMsoNormal>&gt; stil=
l slow). It is expected that a sync should take some time while waiting for=
<o:p></o:p></p><p class=3DMsoNormal>&gt; disk confirmation of the writes, b=
ut more concerning is that these sync<o:p></o:p></p><p class=3DMsoNormal>&g=
t; operations (in writemap mode) are instead dominated by nearly 100% syste=
m CPU<o:p></o:p></p><p class=3DMsoNormal>&gt; utilization, so operations th=
at requires sub-millisecond b-tree update<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; operations are then dominated by very large amounts of system CPU cy=
cles during<o:p></o:p></p><p class=3DMsoNormal>&gt; the sync phase.<o:p></o=
:p></p><p class=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; I=
 think that the fundamental problem is that FlushViewOfFile seems to be an =
O(n)<o:p></o:p></p><p class=3DMsoNormal>&gt; operation where n is the size =
of the file (or map). I presume that Windows is<o:p></o:p></p><p class=3DMs=
oNormal>&gt; scanning the entire map/file for dirty pages to flush, I'm gue=
ssing because it<o:p></o:p></p><p class=3DMsoNormal>&gt; doesn't have an in=
ternal index of all the dirty pages for every file/map-view in<o:p></o:p></=
p><p class=3DMsoNormal>&gt; the OS disk cache. Therefore, the turns into an=
 extremely expensive, CPU-bound<o:p></o:p></p><p class=3DMsoNormal>&gt; ope=
ration to find the dirty pages for (large file) and initiate their writes,<=
o:p></o:p></p><p class=3DMsoNormal>&gt; which, of course, is contrary to th=
e whole goal of a scalable database system.<o:p></o:p></p><p class=3DMsoNor=
mal>&gt; And FlushFileBuffers is also relatively slow as well. We have atte=
mpted to batch<o:p></o:p></p><p class=3DMsoNormal>&gt; as many operations i=
nto single transaction as possible, but this is still a very<o:p></o:p></p>=
<p class=3DMsoNormal>&gt; large overhead.<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; The Windows docs for FlushF=
ileBuffers itself warns about the inefficiencies of<o:p></o:p></p><p class=
=3DMsoNormal>&gt; this function (https://docs.microsoft.com/en-us/windows/d=
esktop/api/fileapi/nf-fileapi-flushfilebuffers).<o:p></o:p></p><p class=3DM=
soNormal>&gt; Which also points to the solution: it is much faster to write=
 out the dirty<o:p></o:p></p><p class=3DMsoNormal>&gt; pages with WriteFile=
 through a sync file handle (FILE_FLAG_WRITE_THROUGH).<o:p></o:p></p><p cla=
ss=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; The associated=
 patch<o:p></o:p></p><p class=3DMsoNormal>&gt; (https://github.com/kriszyp/=
node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9)<o:p></o:p></p><p=
 class=3DMsoNormal>&gt; is my attempt at implementing this solution, for Wi=
ndows. Fortunately, with the<o:p></o:p></p><p class=3DMsoNormal>&gt; design=
 of LMDB, this is relatively straightforward. LMDB already supports<o:p></o=
:p></p><p class=3DMsoNormal>&gt; writing out dirty pages with WriteFile cal=
ls. I added a write-through handle for<o:p></o:p></p><p class=3DMsoNormal>&=
gt; sending these writes directly to disk. I then made that file-handle<o:p=
></o:p></p><p class=3DMsoNormal>&gt; overlapped/asynchronously, so all the =
writes for a commit could be started in<o:p></o:p></p><p class=3DMsoNormal>=
&gt; overlap mode, and (at least theoretically) transfer in parallel to the=
 drive and<o:p></o:p></p><p class=3DMsoNormal>&gt; then used GetOverlappedR=
esult to wait for the completion. So basically<o:p></o:p></p><p class=3DMso=
Normal>&gt; mdb_page_flush becomes the sync. I extended the writing of dirt=
y pages through<o:p></o:p></p><p class=3DMsoNormal>&gt; WriteFile to writem=
ap mode as well (for writing meta too), so that WriteFile<o:p></o:p></p><p =
class=3DMsoNormal>&gt; with write-through can be used to flush the data wit=
hout ever needing to call<o:p></o:p></p><p class=3DMsoNormal>&gt; FlushView=
OfFile or FlushFileBuffers. I also implemented support for write<o:p></o:p>=
</p><p class=3DMsoNormal>&gt; gathering in writemap mode where contiguous f=
ile positions infers contiguous<o:p></o:p></p><p class=3DMsoNormal>&gt; mem=
ory (by tracking the starting position with wdp and writing contiguous page=
s<o:p></o:p></p><p class=3DMsoNormal>&gt; in single operations). Sorting of=
 the dirty list is maintained even in writemap<o:p></o:p></p><p class=3DMso=
Normal>&gt; mode for this purpose.<o:p></o:p></p><p class=3DMsoNormal><o:p>=
&nbsp;</o:p></p><p class=3DMsoNormal>What is the point of using writemap mo=
de if you still need to use WriteFile<o:p></o:p></p><p class=3DMsoNormal>on=
 every individual page?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p=
></p><p class=3DMsoNormal>&gt; The performance benefits of this patch, in m=
y testing, are considerable. Writing<o:p></o:p></p><p class=3DMsoNormal>&gt=
; out/syncing transactions is typically over 5x faster in writemap mode, an=
d 2x<o:p></o:p></p><p class=3DMsoNormal>&gt; faster in standard mode. And p=
erhaps more importantly (especially in environment<o:p></o:p></p><p class=
=3DMsoNormal>&gt; with many threads/processes), the efficiency benefits are=
 even larger,<o:p></o:p></p><p class=3DMsoNormal>&gt; particularly in write=
map mode, where there can be a 50-100x reduction in the<o:p></o:p></p><p cl=
ass=3DMsoNormal>&gt; system CPU usage by using this patch. This brings wind=
ows performance with<o:p></o:p></p><p class=3DMsoNormal>&gt; sync'ed transa=
ctions in LMDB back into the range of &quot;lightning&quot; performance :).=
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNorm=
al>What is the performance difference between your patch using writemap, an=
d just<o:p></o:p></p><p class=3DMsoNormal>not using writemap in the first p=
lace?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMs=
oNormal>-- <o:p></o:p></p><p class=3DMsoNormal>&nbsp;&nbsp;-- Howard Chu<o:=
p></o:p></p><p class=3DMsoNormal>&nbsp; CTO, Symas Corp.&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; http://www.symas.com<o:p></o:p></=
p><p class=3DMsoNormal>&nbsp; Director, Highland Sun&nbsp;&nbsp;&nbsp;&nbsp=
; http://highlandsun.com/hyc/<o:p></o:p></p><p class=3DMsoNormal>&nbsp; Chi=
ef Architect, OpenLDAP&nbsp; http://www.openldap.org/project/<o:p></o:p></p=
><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal><o:p>&nbsp;=
</o:p></p></div></body></html>=

--_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_--