Full_Name: Hallvard Breien Furuseth Version: 2.4.47, master OS: Linux x86_64 URL: ftp://ftp.openldap.org/incoming/Hallvard-Furuseth-190119.tgz Submission from: (NULL) (193.90.50.50) Use cn=config to add an index to a subordinate database, then make a 2nd cn=config change. 2nd change waits while (I assume) indexing, and blocks other slapd operations. So ldapwhoami during 2nd change takes 5 seconds. To reproduce: ./bug.sh To debug, attach gdb in another terminal during 2nd MOD. gdb -q ../servers/slapd/slapd `cat slapd.pid` trace.txt shows the 2nd config_back_modify() waiting in slap_pause_server().
h.b.furuseth@usit.uio.no wrote: > Full_Name: Hallvard Breien Furuseth > Version: 2.4.47, master > OS: Linux x86_64 > URL: ftp://ftp.openldap.org/incoming/Hallvard-Furuseth-190119.tgz > Submission from: (NULL) (193.90.50.50) > > > Use cn=config to add an index to a subordinate database, > then make a 2nd cn=config change. 2nd change waits while > (I assume) indexing, and blocks other slapd operations. > So ldapwhoami during 2nd change takes 5 seconds. > > To reproduce: ./bug.sh > To debug, attach gdb in another terminal during 2nd MOD. > gdb -q ../servers/slapd/slapd `cat slapd.pid` > > trace.txt shows the 2nd config_back_modify() waiting > in slap_pause_server(). This is not really surprising. Are you suggesting the indexing task should stop if another config mod comes in? -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Testing again 'subordinate' looks irrelevant. I wonder what I did in my original and bigger testcase to make it seem relevant. On 1/19/19 6:30 PM, Howard Chu wrote: > This is not really surprising. I don't know these interactions well enough to be unsurprised. It's not hard to guess that modify(olcDbIndex) might be heavy and had better be tested. A simple test says it returns quickly. I don't know why the ongoing indexing keeps cn=config locked after the operation succeeded, instead of keeping something in the indexed DB locked. I wouldn't expect others to guess that either. > Are you suggesting the indexing task should stop if another config mod comes in? I'm suggesting to fail the 2nd operation if that's what it takes. Or only hang if the user passes some LDAP control. I'm suggesting that config pauses should be brief. Server hangs are bad. Even a crash can be better when other load-balanced servers can take over. Not that I'm suggesting that "fix":-)
Hallvard Breien Furuseth wrote: > Testing again 'subordinate' looks irrelevant. I wonder what I > did in my original and bigger testcase to make it seem relevant. > > On 1/19/19 6:30 PM, Howard Chu wrote: >> This is not really surprising. > > I don't know these interactions well enough to be unsurprised. > > It's not hard to guess that modify(olcDbIndex) might be heavy and > had better be tested. A simple test says it returns quickly. I > don't know why the ongoing indexing keeps cn=config locked after > the operation succeeded, instead of keeping something in the > indexed DB locked. I wouldn't expect others to guess that either. modifications to cn=config require no other threads to be active. Once the background indexer starts running, its activity will of course block further cn=config modifications until it completes. >> Are you suggesting the indexing task should stop if another config mod comes in? > > I'm suggesting to fail the 2nd operation if that's what it takes. > Or only hang if the user passes some LDAP control. > > I'm suggesting that config pauses should be brief. Server hangs > are bad. Even a crash can be better when other load-balanced > servers can take over. Not that I'm suggesting that "fix":-) The *server* is not hung. Only modifications to the cn=config DB are blocked. You might make a case that background indexing on a backend shouldn't block writes to unrelated parts of the cn=config tree. The current pause design wouldn't easily support that, but it might be worth working on. Also it's not clear which parts of the tree are actually unrelated. E.g., changing schema elements could affect all backends at once. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On 1/19/19 8:04 PM, Howard Chu wrote: > The *server* is not hung. Only modifications to the cn=config DB are blocked. Yes it is. See the ldapwhoami in my example.
On 1/19/19 9:55 PM, h.b.furuseth@usit.uio.no wrote: > On 1/19/19 8:04 PM, Howard Chu wrote: >> The *server* is not hung. Only modifications to the cn=config DB are blocked. > > Yes it is. That is, the 2nd mod to cn=config blocks, and that's fine by me, except _then_ the server is hung. > See the ldapwhoami in my example. And the 'time ldapwhoami' result in output.txt. Sorry, unreadable scripting on my part. Notice the single '&' in ldapmodify & { sleep; ldapwhoami; } That is, ldapmodify in the background and then ldapwhoami. I was pasting to/from the command line, and didn't want the ldapmodify to start until I pressed Return for the ldapwhoami. Thus the weird concatenation.
Hallvard Breien Furuseth wrote: > On 1/19/19 9:55 PM, h.b.furuseth@usit.uio.no wrote: >> On 1/19/19 8:04 PM, Howard Chu wrote: >>> The *server* is not hung. Only modifications to the cn=config DB are blocked. >> >> Yes it is. > > That is, the 2nd mod to cn=config blocks, and that's fine by me, except > _then_ the server is hung. Ah, the 2nd cn=config mod has requested a pause, which can't succeed until all background threads complete. And no new threads will be run while a pause is being requested. This was the safest approach, considering that a config change could alter/remove context out from under running threads otherwise. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On 1/20/19 2:14 AM, Howard Chu wrote: > Ah, the 2nd cn=config mod has requested a pause, which can't succeed until all > background threads complete. And no new threads will be run while a pause is being requested. Exactly. > This was the safest approach, considering that a config change could alter/remove context out > from under running threads otherwise. No, the safest approach is to fail the 2nd cn=config mod unless the user says it's OK to block the server.
On 1/20/19 4:48 AM, Hallvard Breien Furuseth wrote: > On 1/20/19 2:14 AM, Howard Chu wrote: >> This was the safest approach, considering that a config change could >> alter/remove context out >> from under running threads otherwise. > > No, the safest approach is to fail the 2nd cn=config mod > unless the user says it's OK to block the server. Or fail the 1st cn=config mod too, if that's easier.
Hallvard Breien Furuseth wrote: > On 1/20/19 4:48 AM, Hallvard Breien Furuseth wrote: >> On 1/20/19 2:14 AM, Howard Chu wrote: >>> This was the safest approach, considering that a config change could alter/remove context out >>> from under running threads otherwise. >> >> No, the safest approach is to fail the 2nd cn=config mod >> unless the user says it's OK to block the server. How would the user say so? > Or fail the 1st cn=config mod too, if that's easier. Huh? That op has already completed. The indexer is started as a background task after the mod op completes. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On 1/20/19 5:26 AM, Howard Chu wrote: > Hallvard Breien Furuseth wrote: >> On 1/20/19 4:48 AM, Hallvard Breien Furuseth wrote: >>> On 1/20/19 2:14 AM, Howard Chu wrote: >>>> This was the safest approach, considering that a config change could alter/remove context out >>>> from under running threads otherwise. >>> >>> No, the safest approach is to fail the 2nd cn=config mod >>> unless the user says it's OK to block the server. > > How would the user say so? An LDAP control. Or the modify operation could modify an operational attribute which which doesn't exist in LDAP, it just changes the operation's behavior. Or only accept cn=config mods after the connection has added an ephemeral control entry which says how cn=config mods should behave in that connection. Maybe that's the least annoying way, since there won't be any special cases for the use to cope with. Either way, it sounds like there's no right answer for RE24's default behavior, and the only fix there would involve a config option which says which behavior to pick. (It'd be nice to be able to say things like "it's OK to hang if ldapi:// is the only listener", but I suppose then people have a hundred other conditions which would be nice too.) I don't like any of my own answers to this though. I just dislike server hangs more. >> Or fail the 1st cn=config mod too, if that's easier. > > Huh? That op has already completed. The indexer is started as a background task after > the mod op completes. I meant fail when the operation is issued, but never mind. That was silly, and cumbersome for the user to find out when he can safely issue the next operation. Either way, diagnosticMessage "RTFM: back-mdb(5) about olcDbIndex and slapd-config(5) about possible server hangs". And we should document when cn=config mods can be dangerous and how to use it safely without the users noticing. There should in any case be a method to ask the server "is cn=config very busy now, or is another cn=config mod safe?". Like failing the operation if it is not, like I said first.
On 1/19/19 8:04 PM, Howard Chu wrote: > modifications to cn=config require no other threads to be active. Once the background indexer > starts running, its activity will of course block further cn=config modifications until it > completes. Nothing "of course" about it. I don't know why running the background indexer is considered a change to cn=config. The ldapmodify did the change to cn=config and completed. I can guess that it might block cn=config simply because that's the safe way to have both an indexer and dynamic config. Or I can read the code. But "read the code" != "of course". So... back to start, I guess I was suggesting that the indexing task should stop if another config mod comes in, if it can resume afterwards. And the config change need only fail if it might prevent the indexer from resuming. I suppose that gets complicated in a hurry though.
Hallvard Breien Furuseth wrote: > On 1/19/19 8:04 PM, Howard Chu wrote: >> modifications to cn=config require no other threads to be active. Once the background indexer >> starts running, its activity will of course block further cn=config modifications until it >> completes. > > Nothing "of course" about it. I don't know why running > the background indexer is considered a change to cn=config. It is not. But the indexing rules can be changed by further config mods, and we don't want the config to change out from under it. > The ldapmodify did the change to cn=config and completed. > > I can guess that it might block cn=config simply because that's > the safe way to have both an indexer and dynamic config. Or > I can read the code. But "read the code" != "of course". This was all discussed on the -devel mailing list when the feature was first implemented. http://www.openldap.org/lists/openldap-devel/200504/msg00090.html > So... back to start, I guess I was suggesting that the > indexing task should stop if another config mod comes in, > if it can resume afterwards. And the config change need > only fail if it might prevent the indexer from resuming. > I suppose that gets complicated in a hurry though. Stopping and resuming is problematic, especially if other config changes to the DB occur that would invalidate the in-progress indexing. We could arrange to return LDAP_BUSY to config mods if any background indexers are running. Or as I already mentioned, we could try to only do this for relevant config mods, e.g. to the active backend, or to global schema. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Can cn=config have 2 lock levels? That fixes this ITS (my ldapwhoami hang) without introducing cn=config failures. The outer lock serializes cn=config ops including the indexer. Other ops do not lock it. The inner lock blocks slapd like cn=config does now, and should only be held briefly. Always hold the outer lock when taking the inner lock. Also, the inner lock could be taken and released several times during a config op when feasible. That might improve non-config op latency, for the price of slower config ops. On 1/20/19 4:42 PM, Howard Chu wrote: > Hallvard Breien Furuseth wrote: >> On 1/19/19 8:04 PM, Howard Chu wrote: >>> modifications to cn=config require no other threads to be active. Once the background indexer >>> starts running, its activity will of course block further cn=config modifications until it >>> completes. >> >> Nothing "of course" about it. I don't know why running >> the background indexer is considered a change to cn=config. > > It is not. But the indexing rules can be changed by further config mods, and we > don't want the config to change out from under it. OK. I was just parsing your worlds too literally. Also, I'm trying to think like an admin, not an OpenLDAP insider: >> The ldapmodify did the change to cn=config and completed. >> >> I can guess that it might block cn=config simply because that's >> the safe way to have both an indexer and dynamic config. Or >> I can read the code. But "read the code" != "of course". > > This was all discussed on the -devel mailing list when the feature was first implemented. > http://www.openldap.org/lists/openldap-devel/200504/msg00090.html Thanks. >> So... back to start, I guess I was suggesting that the >> indexing task should stop if another config mod comes in, >> if it can resume afterwards. And the config change need >> only fail if it might prevent the indexer from resuming. >> I suppose that gets complicated in a hurry though. > > Stopping and resuming is problematic, especially if other config changes to the DB occur > that would invalidate the in-progress indexing. OK. > We could arrange to return LDAP_BUSY to config mods if any background indexers are running. > Or as I already mentioned, we could try to only do this for relevant config mods, e.g. to > the active backend, or to global schema. I'm mostly interested in the worst expected case during normal cn=config use - whether hang or error. Limiting the problems beyond that is a bonus. Or maybe not. A cn=config which fails in some obscure cases is flaky. A cn=config which fails more frequently but in predictable ways can be annoying, but is easy to figure out how to operate. I suppose a simple test is how simple the documentation can be: If I want non-config operations to not hang, what do I do and what do I avoid? And same thing if want to write a config script which won't need to cope with expected failures.
I wrote: > Limiting the problems beyond that is a bonus. Or maybe not. > A cn=config which fails in some obscure cases is flaky. (...) Correction: Obscure _situations_, not obscure cases. As in, operation X normally succeds, but can fail if you just did the somewhat-related operation Y.
On 20.01.2019 19:52, Hallvard Breien Furuseth wrote: > Can cn=config have 2 lock levels? That fixes this ITS (my > ldapwhoami hang) without introducing cn=config failures. > > The outer lock serializes cn=config ops including the indexer. > Other ops do not lock it. The inner lock blocks slapd like > cn=config does now, and should only be held briefly. Always > hold the outer lock when taking the inner lock. Correction again: while taking and holding the inner lock. Except non-config ops. > Also, the inner lock could be taken and released several > times during a config op when feasible. That might improve > non-config op latency, for the price of slower config ops.
On Mon, Jan 21, 2019 at 03:23:55PM +0000, h.b.furuseth@usit.uio.no wrote: > On 20.01.2019 19:52, Hallvard Breien Furuseth wrote: > > Can cn=config have 2 lock levels? That fixes this ITS (my > > ldapwhoami hang) without introducing cn=config failures. > > > > The outer lock serializes cn=config ops including the indexer. > > Other ops do not lock it. The inner lock blocks slapd like > > cn=config does now, and should only be held briefly. Always > > hold the outer lock when taking the inner lock. > > Correction again: while taking and holding the inner lock. > Except non-config ops. Except there are no locks as you know being the author of parts of code that deals with what I'm about to outline anyway: Whenever a cn=config op is about to be processed, a pause is requested. That tells worker threads to stop picking up new work and waits until they're all quiet. The indexing task is run by one of these worker threads. Some long-running processes (syncrepl) will regularly check to see if they should pause as well: https://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=servers/slapd/syncrepl.c;hb=HEAD#l1572 Handling things in a similar way might be an option if we could survive anything changing in cn=config while we yield. That might be a big "if" to implement, though. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On 1/21/19 7:49 PM, Ondřej Kuzník wrote: > Except there are no locks as you know being the author of parts of > code that deals with what I'm about to outline anyway: > > Whenever a cn=config op is about to be processed, a pause is requested. > That tells worker threads to stop picking up new work and waits until > they're all quiet. The indexing task is run by one of these worker > threads. Duuh, right. I got stuck looking for what's special about the indexing task and couldn't find it:-( I need to make it special. So, let tasks declare their expected speed until finish or between pausechecks. At FAST=1 (default) and SLOW=0. A pause only stops tasks with speed < ltp_pause. In thread_pool_pause(), replace the WANT_PAUSE stage with while (++ltp_pause <= max speed) { wait until no more tasks with speed < ltp_pause; } Then fast tasks should breeze past slow ones when preparing to pause. Until all threads have slow tasks, anyway. To mitigate that, we'd need to predeclare the speed when submitting a task, and limit the number of parallel slow tasks. pool_submit() could stash the rest in a "slow queue" instead of submitting. But I don't want to go there yet.
On Tue, Jan 22, 2019 at 05:49:12PM +0000, h.b.furuseth@usit.uio.no wrote: > On 1/21/19 7:49 PM, Ondřej Kuzník wrote: >> Except there are no locks as you know being the author of parts of >> code that deals with what I'm about to outline anyway: >> >> Whenever a cn=config op is about to be processed, a pause is requested. >> That tells worker threads to stop picking up new work and waits until >> they're all quiet. The indexing task is run by one of these worker >> threads. > > Duuh, right. I got stuck looking for what's special about the > indexing task and couldn't find it:-( I need to make it special. > > So, let tasks declare their expected speed until finish or > between pausechecks. At FAST=1 (default) and SLOW=0. > A pause only stops tasks with speed < ltp_pause. > In thread_pool_pause(), replace the WANT_PAUSE stage with > > while (++ltp_pause <= max speed) { > wait until no more tasks with speed < ltp_pause; > } > > Then fast tasks should breeze past slow ones when preparing > to pause. Until all threads have slow tasks, anyway. > > To mitigate that, we'd need to predeclare the speed when > submitting a task, and limit the number of parallel slow > tasks. pool_submit() could stash the rest in a "slow queue" > instead of submitting. But I don't want to go there yet. The problem is not that it's still scheduled when the pause is requested, scheduled tasks don't prevent a pause (and I think some of them need to be a bit more aware of that fact), but that it's been picked up by a worker thread and is running while cn=config waits for them to pause. What I was suggesting was that it could check every so often whether it might be holding up a pause and join it just like some other tasks do. But that's only a good idea if it's able to handle the changes that happen during the pause (indexes reconfigured again, the database going away, ...) as well as being able to release any locks temporarily. -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On 22.01.2019 19:55, Ondřej Kuzník wrote: > The problem is not that it's still scheduled when the pause is > requested, scheduled tasks don't prevent a pause (and I think some > of them need to be a bit more aware of that fact), but that it's been > picked up by a worker thread and is running while cn=config waits for > them to pause. _And_ that a pause request is all-or-nothing. This demo code of graduated wait-for-pause works for me. For RE24, it doesn't combine easily with Git master's multiple tpool queues:-( ftp://ftp.openldap.org/incoming/Hallvard-Furuseth-190124.patch.gz But I'm not volunteering to write one for master anytime soon, unless someone has less complicated ideas for that than I do. A real version looks like more work than I expected anyway. > What I was suggesting was that it could check every so often whether it > might be holding up a pause and join it just like some other tasks do. > But that's only a good idea if it's able to handle the changes that > happen during the pause (indexes reconfigured again, the database going > away, ...) as well as being able to release any locks temporarily. Yes. I should have answered that, but I still cannot decide what I think of it. I see no locks to worry about in that code, but the rest does make me nervous. Possibly we can find some times where we can pause safely, but not a general case. For example, in the case of back-mdb we could register scheduled config operations the database itself, in a "(TODO)" LMDB DB. Doesn't help with schema changes, but does help renumbering slapd DBs. -- Hallvard
My bug report's bug.sh script is broken. It needs "mpid=$!" before "sleep 0.5", and "wait $mpid" before "kill $pid". -- Hallvard
The reason I like handling the problem in tpool is that tpools is fairly isolated, no worries about which bconfig parts are safe from each other. But by all means attack the problem in bconfig too:-) I've pushed a better version to <git://git.uio.no/u/hbf/openldap.git> branch "its8958-re24-task-speeds-v2". And I see how to port this to master after all, if anyone wants this included in OpenLDAP. RE24 tpool is simpler when discusissing the basics, though. -- Hallvard
Created attachment 690 [details] Hallvard-Furuseth-2019-01-19.tgz
Created attachment 799 [details] proposed fix
(In reply to Quanah Gibson-Mount from comment #23) > Created attachment 799 [details] > proposed fix In patch #5 + ldap_pvt_thread_pool_setspeed( &connection_pool, ctx, 0 ); Shouldn't the minimum speed be 1, not 0? Since you have +enum { NOT_PAUSED = 0, WANT_PAUSE = LDAP_PVT_THREAD_POOL_SPEED_MAX+1, PAUSED }; A speed of 0 would mean no pause at all, wouldn't it?
(In reply to Hallvard Furuseth from comment #17) > Duuh, right. I got stuck looking for what's special about the > indexing task and couldn't find it:-( I need to make it special. > > So, let tasks declare their expected speed until finish or > between pausechecks. At FAST=1 (default) and SLOW=0. > A pause only stops tasks with speed < ltp_pause. > In thread_pool_pause(), replace the WANT_PAUSE stage with > > while (++ltp_pause <= max speed) { > wait until no more tasks with speed < ltp_pause; > } > > Then fast tasks should breeze past slow ones when preparing > to pause. Until all threads have slow tasks, anyway. I don't understand how this solves anything. If a slow indexing task is currently running, and a fast config mod comes in, it's still the case that the config change could pull the DB out from under the indexer task. So there's nothing safe about letting the fast task progress while the slow task is still running.
(In reply to Howard Chu from comment #25) > (In reply to Hallvard Furuseth from comment #17) > > > Duuh, right. I got stuck looking for what's special about the > > indexing task and couldn't find it:-( I need to make it special. > > > > So, let tasks declare their expected speed until finish or > > between pausechecks. At FAST=1 (default) and SLOW=0. > > A pause only stops tasks with speed < ltp_pause. > > In thread_pool_pause(), replace the WANT_PAUSE stage with > > > > while (++ltp_pause <= max speed) { > > wait until no more tasks with speed < ltp_pause; > > } > > > > Then fast tasks should breeze past slow ones when preparing > > to pause. Until all threads have slow tasks, anyway. > > I don't understand how this solves anything. If a slow indexing > task is currently running, and a fast config mod comes in, it's > still the case that the config change could pull the DB out from > under the indexer task. So there's nothing safe about letting the > fast task progress while the slow task is still running. I don't think we should be changing anything else about how tpool handles pauses. We should just be fixing this specific case of the indexer being a slow task, by implementing checkpointing into the indexer. I.e., when it detects a pause request it should save its current progress and pause itself. If it gets resumed it can pick up where it left off, or if a config change affects it it can abort or or start over. A checkpointing mechanism is needed anyway, for the case of a (clean) shutdown while the indexer is running.
On Tue, Aug 03, 2021 at 12:42:01PM +0000, openldap-its@openldap.org wrote: > I don't think we should be changing anything else about how tpool > handles pauses. We should just be fixing this specific case of the > indexer being a slow task, by implementing checkpointing into the > indexer. I.e., when it detects a pause request it should save its > current progress and pause itself. If it gets resumed it can pick up > where it left off, or if a config change affects it it can abort or > or start over. A checkpointing mechanism is needed anyway, for the > case of a (clean) shutdown while the indexer is running. I'll put a suggestion here that we discussed previously: to support this checkpointing for pauses/shutdowns, the indexer could be writing to a "scratch" database (whatever that means for each backend) along with resume data and move them into place when finished. You mentioned that for liblmdb, this would need support for a database to be renamed.
On 03.08.2021 14:36, openldap-its@openldap.org wrote: > https://bugs.openldap.org/show_bug.cgi?id=8958 > > --- Comment #25 from Howard Chu <hyc@openldap.org> --- > (In reply to Hallvard Furuseth from comment #17) > >> (...) A pause only stops tasks with speed < ltp_pause. >> In thread_pool_pause(), replace the WANT_PAUSE stage with >> >> while (++ltp_pause <= max speed) { >> wait until no more tasks with speed < ltp_pause; >> } >> >> Then fast tasks should breeze past slow ones when preparing >> to pause. Until all threads have slow tasks, anyway. > > I don't understand how this solves anything. If a slow indexing > task is currently running, and a fast config mod comes in, it's > still the case that the config change could pull the DB out from > under the indexer task. So there's nothing safe about letting the > fast task progress while the slow task is still running Fast tasks still wait for *running* slow tasks. And when there is no pause involved, slow tasks get scheduled normally. This is only about scheduling when something wants a pause. setspeed() does CHECK_PAUSE, standing aside for faster tasks. Then, a fast task which wants a pause (cn=config change #2) won't block other fast tasks while a slower task (indexer) is running. So normal tasks will keep getting scheduled, instead of slapd locking up for them. This all depends on there being only a few config changes/slow tasks at any time, since they do occupy a thread.
On 03.08.2021 14:25, openldap-its@openldap.org wrote: > --- Comment #24 from Howard Chu <hyc@openldap.org> --- > (In reply to Quanah Gibson-Mount from comment #23) >> Created attachment 799 [details] >> proposed fix > > In patch #5 > + ldap_pvt_thread_pool_setspeed( &connection_pool, ctx, 0 ); > > Shouldn't the minimum speed be 1, not 0? That's just the API. 0 = "slowest". I didn't want to export details of the tpool implementation, which might get replaced. Could use 0.0 so it looks different, if floating point numbers are OK in libldap. > Since you have > +enum { NOT_PAUSED = 0, WANT_PAUSE = LDAP_PVT_THREAD_POOL_SPEED_MAX+1, PAUSED > };
I wrote wrote: > https://bugs.openldap.org/show_bug.cgi?id=8958 >> In patch #5 >> + ldap_pvt_thread_pool_setspeed( &connection_pool, ctx, 0 ); >> >> Shouldn't the minimum speed be 1, not 0? > > That's just the API. 0 = "slowest". I didn't want to export details > of the tpool implementation, which might get replaced. Could use 0.0 > so it looks different, if floating point numbers are OK in libldap. That is, I have a vague feeling that merely mentioning a floating point number would require libm in some C implementations. Don't remember.
On 03.08.2021 14:42, openldap-its@openldap.org wrote: > https://bugs.openldap.org/show_bug.cgi?id=8958 > > --- Comment #26 from Howard Chu <hyc@openldap.org> --- > I don't think we should be changing anything else about how tpool > handles pauses. We should just be fixing this specific case of the > indexer being a slow task, by implementing checkpointing into the > indexer. I.e., when it detects a pause request it should save its > current progress and pause itself. If it gets resumed it can pick up > where it left off, or if a config change affects it it can abort or > or start over. A checkpointing mechanism is needed anyway, for the > case of a (clean) shutdown while the indexer is running. For fixing the observed problem: Improving the indexer sounds great in any case, go ahead:-) No idea how much work it is. tpool.c was code I knew how to change, so I did. Will it then be as reactive as ordinary tasks, also for large databases? Merely "much faster than now" might be very different from "fast enough to not be a problem". In general: I do think slapd should recognize that some tasks can be notably slower that others. Latency is in part a scheduling issue, and it can hit particularly hard at pauses. Currently tpool does not help with that. Setspeed can help. Except if a run-time config change removes the database/overlay, I'm not up to date about whether that can happen. But that's so for rescheduling the indexer as well. back-sock/back-shell can also be slow, so I expect they have the same issue. If they "fix" that with pool_idle() before reading results, I expect config changes could hose the operation badly. Same with setspeed() in the backend, it'd be too late. Maybe if the database declared itself "slow", then slapd could setspeed() before dispatching the operation to the database. Hallvard
(In reply to Hallvard Furuseth from comment #31) > On 03.08.2021 14:42, openldap-its@openldap.org wrote: > > https://bugs.openldap.org/show_bug.cgi?id=8958 > > > > --- Comment #26 from Howard Chu <hyc@openldap.org> --- > > I don't think we should be changing anything else about how tpool > > handles pauses. We should just be fixing this specific case of the > > indexer being a slow task, by implementing checkpointing into the > > indexer. I.e., when it detects a pause request it should save its > > current progress and pause itself. If it gets resumed it can pick up > > where it left off, or if a config change affects it it can abort or > > or start over. A checkpointing mechanism is needed anyway, for the > > case of a (clean) shutdown while the indexer is running. > > > For fixing the observed problem: > > Improving the indexer sounds great in any case, go ahead:-) > No idea how much work it is. tpool.c was code I knew how > to change, so I did. > > Will it then be as reactive as ordinary tasks, also for > large databases? Merely "much faster than now" might be > very different from "fast enough to not be a problem". I'm working it out now. As for reactiveness, that depends only on how much data we index in one chunk, not on the overall size of the DB. If you still believe there's a potential problem with more than one pause request a time, your patch might still be useful but it will need to be adapted for the multiple queues in 2.5. Nothing is going to be changed for 2.4.
On 04.08.2021 13:26, openldap-its@openldap.org wrote: > https://bugs.openldap.org/show_bug.cgi?id=8958 > > --- Comment #32 from Howard Chu <hyc@openldap.org> ---. > (...) > If you still believe there's a potential problem with more than > one pause request a time, your patch might still be useful but > it will need to be adapted for the multiple queues in 2.5. Nothing > is going to be changed for 2.4. Right. The patch is demo code for 2.4 because that was simplest. I wasn't going to do more work for a feature which might get rejected. Ondřej, I'm guessing the use of a "scratch" database belongs in a separate issue and will get lost if left lingering in this one, since it's a much bigger issue.
Indexer fix in https://git.openldap.org/openldap/openldap/-/merge_requests/372
On my laptop, running against current git master, your test #### bash bug.sh 5000 610d21b6.1d96b64b 0x7f39c4378780 @(#) $OpenLDAP: slapd 2.X (Aug 6 2021 12:48:05) $ hyc@viola:/home/hyc/OD/hobj/servers/slapd 610d21b6.1e6b60b1 0x7f39c4378780 slapd starting 610d21b7.1e5278fb 0x7f39bfd56640 conn=1000 fd=14 ACCEPT from PATH=ldapi (PATH=ldapi) 610d21b7.1e5a824a 0x7f39bf555640 conn=1000 op=0 BIND dn="" method=163 610d21b7.1e5df1da 0x7f39bf555640 conn=1000 op=0 BIND authcid="gidNumber=1000+uidNumber=1000,cn=peercred,cn=external,cn=auth" authzid="gidNumber=1000+uidNumber=1000,cn=peercred,cn=external,cn=auth" 610d21b7.1e5e3c0e 0x7f39bf555640 conn=1000 op=0 BIND dn="cn=admin" mech=EXTERNAL bind_ssf=0 ssf=71 610d21b7.1e5f7957 0x7f39bf555640 conn=1000 op=0 RESULT tag=97 err=0 qtime=0.000070 etime=0.000420 text= modifying entry "olcDatabase={1}mdb,cn=config" 610d21b7.1e69ad03 0x7f39bfd56640 conn=1000 op=1 MOD dn="olcDatabase={1}mdb,cn=config" 610d21b7.1e6a41b2 0x7f39bfd56640 conn=1000 op=1 MOD attr=olcDbIndex 610d21b7.1f03411d 0x7f39bfd56640 conn=1000 op=1 RESULT tag=103 err=0 qtime=0.000033 etime=0.010200 text= modifying entry "cn=config" 610d21b7.1f086a5d 0x7f39bfd56640 conn=1000 op=2 MOD dn="cn=config" 610d21b7.1f089460 0x7f39bfd56640 conn=1000 op=2 MOD attr=olcGentleHUP 610d21c3.080b1010 0x7f39bf555640 mdb_online_index: database ou=people,o=foo: txn_commit failed: MDB_MAP_FULL: Environment mapsize limit reached (-30792) 610d21c3.0812f25b 0x7f39bfd56640 conn=1000 op=2 RESULT tag=103 err=0 qtime=0.000032 etime=11.614863 text= 610d21c3.081378db 0x7f39bf555640 conn=1001 fd=15 ACCEPT from PATH=ldapi (PATH=ldapi) 610d21c3.0814c56b 0x7f39bfd56640 conn=1001 op=0 BIND dn="" method=128 610d21c3.08151f2d 0x7f39bfd56640 conn=1001 op=0 RESULT tag=97 err=0 qtime=0.000010 etime=0.000038 text= 610d21c3.08156590 0x7f39bf555640 conn=1000 op=3 UNBIND 610d21c3.08161fe5 0x7f39bf555640 conn=1000 fd=14 closed 610d21c3.08185c43 0x7f39bfd56640 conn=1001 op=1 EXT oid=1.3.6.1.4.1.4203.1.11.3 610d21c3.0818928e 0x7f39bfd56640 conn=1001 op=1 WHOAMI 610d21c3.0818e838 0x7f39bfd56640 conn=1001 op=1 RESULT oid= err=0 qtime=0.000021 etime=0.000060 text= anonymous 610d21c3.081b5c3e 0x7f39bf555640 conn=1001 op=2 UNBIND 610d21c3.081bbc00 0x7f39bf555640 conn=1001 fd=15 closed real 0m11.149s user 0m0.026s sys 0m0.017s 610d21c3.08211dff 0x7f39c0557640 daemon: shutdown requested and initiated. 610d21c3.08225abc 0x7f39c0557640 slapd shutdown: waiting for 0 operations/tasks to finish 610d21c3.084e3266 0x7f39c4378780 slapd stopped. #### With the patch in !372 the script finishes immediately: #### 610d21e2.3788b515 0x7f7b91a17640 conn=1001 op=2 UNBIND 610d21e2.37892e36 0x7f7b91a17640 conn=1001 fd=14 closed real 0m0.010s user 0m0.007s sys 0m0.004s 610d21e2.378df428 0x7f7b92a19640 daemon: shutdown requested and initiated. 610d21e2.378f3b89 0x7f7b92a19640 slapd shutdown: waiting for 1 operations/tasks to finish 610d21e2.37c2a4bd 0x7f7b9683a780 slapd stopped. ####
On 06.08.2021 13:38, openldap-its@openldap.org wrote: > --- Comment #34 from Howard Chu <hyc@openldap.org> --- > Indexer fix in https://git.openldap.org/openldap/openldap/-/merge_requests/372 I don't think you should change an existing function (pausecheck) to do something completely different. That silently breaks existing code. Renaming it makes sense since the name is confusiong, but use another name for the new functionality. pause_status(), maybe.
(In reply to Hallvard Furuseth from comment #36) > On 06.08.2021 13:38, openldap-its@openldap.org wrote: > > --- Comment #34 from Howard Chu <hyc@openldap.org> --- > > Indexer fix in https://git.openldap.org/openldap/openldap/-/merge_requests/372 > > I don't think you should change an existing function (pausecheck) to do > something completely different. That silently breaks existing code. > Renaming it makes sense since the name is confusiong, but use another > name for the new functionality. pause_status(), maybe. Fair enough. Changed to pausequery().
(In reply to Hallvard Furuseth from comment #36) > On 06.08.2021 13:38, openldap-its@openldap.org wrote: > > --- Comment #34 from Howard Chu <hyc@openldap.org> --- > > Indexer fix in https://git.openldap.org/openldap/openldap/-/merge_requests/372 > > I don't think you should change an existing function (pausecheck) to do > something completely different. That silently breaks existing code. > Renaming it makes sense since the name is confusiong, but use another > name for the new functionality. pause_status(), maybe. Generally I'd expect this type of commentary to be in the MR for historical purposes.
2.5 compat with 2.4 dbs: Commits: • d877251b by Howard Chu at 2021-08-06T22:50:23+01:00 (From ITS#8958) allow 2.5 slapcat to read 2.4 DB
RE25 commit: 2.5 slapcat compat with 2.4 dbs: Commits: • d00efea7 by Howard Chu at 2021-08-06T21:54:44+00:00 (From ITS#8958) allow 2.5 slapcat to read 2.4 DB
Commits: • 5ad6ab35 by Howard Chu at 2021-08-12T18:59:06+00:00 ITS#8958 rename ldap_pvt_thread_pool_pausecheck() • f6a61ab7 by Howard Chu at 2021-08-12T18:59:06+00:00 ITS#8958 back-mdb: checkpoint online indexer
*** Issue 5840 has been marked as a duplicate of this issue. ***
ldap_pvt_thread_pool_pausequery does not exist in RE25, which is why this is a 2.6 or later only fix. (for future reference and why the fix was reverted from RE25)