Skip to content

WorkerThreadPool: Revamp interaction with ScriptServer#96959

Merged
clayjohn merged 4 commits into
godotengine:masterfrom
RandomShaper:revamp_languages_exit
Sep 16, 2024
Merged

WorkerThreadPool: Revamp interaction with ScriptServer#96959
clayjohn merged 4 commits into
godotengine:masterfrom
RandomShaper:revamp_languages_exit

Conversation

@RandomShaper

@RandomShaper RandomShaper commented Sep 13, 2024

Copy link
Copy Markdown
Member

First, this reverts 2d1dd41 from #96760. The approach there is not enough, given that with separate thread rendering, there's a cyclic dependency chain on shutdown:

  • WorkerThreadPool must finish before ScriptServer.
  • RenderingServer must finish before WorkerThreadPool.
  • ScriptServer must finish before RenderingServer.

A way of breaking the cycle, which this PR implements, is splitting WorkerThreadPool regular lifetime into two three🆕 "runlevels":

  • One that works as usual.
  • One that waits until all the threads are idle, with no more tasks queued, plus adding more tasks blocked (not rejected, just they have to wait). 🆕
  • One that also works normally, but where languages have already been terminated.

I've put the pure refactor into its own commit, which would help a lot for bisect and also for easier reviewing.

It should still fix #95809 (CC @CedNaru).

Fixes #96931.
Fixes #96978. 🆕
Fixes #97073. 🆕🆕


Cherry-picking note: We don't have MutexLock::temp_unlock()/temp_relock() in 4.3. Therefore, the most recent commit in this PR needs to have the following patch applied:

diff --git a/core/object/worker_thread_pool.cpp b/core/object/worker_thread_pool.cpp
index 37a605cb10..2f0e3d4c2a 100644
+++ b/core/object/worker_thread_pool.cpp
@@ -233,11 +233,11 @@ void WorkerThreadPool::_post_tasks(Task **p_tasks, uint32_t p_count, bool p_high
        // in custom builds.
        bool process_on_calling_thread = threads.size() == 0;
        if (process_on_calling_thread) {
-               p_lock.temp_unlock();
+               task_mutex.unlock();
                for (uint32_t i = 0; i < p_count; i++) {
                        _process_task(p_tasks[i]);
                }
-               p_lock.temp_relock();
+               task_mutex.lock();
                return;
        }

@@ -576,9 +576,9 @@ bool WorkerThreadPool::_handle_runlevel(ThreadData *p_thread_data, MutexLock<Bin
                } break;
                case RUNLEVEL_EXIT_LANGUAGES: {
                        if (!p_thread_data->exited_languages) {
-                               p_lock.temp_unlock();
+                               task_mutex.unlock();
                                ScriptServer::thread_exit();
-                               p_lock.temp_relock();
+                               task_mutex.lock();
                                p_thread_data->exited_languages = true;
                                runlevel_data.exit_languages.num_exited_threads++;
                                control_cond_var.notify_all();

@RandomShaper RandomShaper added bug topic:core cherrypick:4.3 Considered for cherry-picking into a future 4.3.x release labels Sep 13, 2024
@RandomShaper RandomShaper added this to the 4.4 milestone Sep 13, 2024
@RandomShaper RandomShaper requested a review from a team as a code owner September 13, 2024 12:57
@CedNaru

CedNaru commented Sep 13, 2024

Copy link
Copy Markdown
Contributor

Ho so it's not a full revert of the previous PR. We still have the bookkeeping done by the ScriptServer.

My bad, I thought I tested with the multithreaded renderer, but it seems I messed up (Doesn't change that I still tested the branch against 4.3-stable and confirmed the difference in behaviour).

Otherwise, it seems to be an improvement again. Now the WorkerThreadPool calls thread_exit() directly instead of letting the Thread do it itself, just like with thread_enter().

So if I understand, the order of execution is:

  • Godot starts

  • WorkThreadPool is initialized before all servers (including ScriptServer and so no thread_enter() possible)

  • Servers are started, and their loop added as a task to the pool if multithreaded.

  • ScriptServer and its languages are initialized.

  • Whenever a new task is started by the pool or that a long-running one yields, thread_enter() is called if not done yet.

  • Godot is closing and cleanup happens.

  • WorkThreadPool calls exit_languages_threads(), which change the "mode" of the pool. From now, the logic is inverted and any task finishing or yielding will call thread_exit().

  • ScriptServer terminates.

  • The rest of the servers terminate too.

  • WorkThreadPool is deleted last.

Now my doubt: What happens if tasks are still queued when we switch to RUNLEVEL_EXIT_LANGUAGES mode. Won't it mean that those tasks will run after thread_exit() was called. The ScriptServer (and its languages) is still up at this point, it's possible the remaining tasks execute code from another language during that short window.

@clayjohn

Copy link
Copy Markdown
Member

Testing locally I can confirm that this fixes #96931

@RandomShaper

Copy link
Copy Markdown
Member Author

@CedNaru, very good assessment. To address the issue about outstanding scripting tasks, I've added a pre-langs-exit runlevel that blocks the addition of new tasks and waits until the queues are empty.

I'm not quite happy with the added complexity even before the last push. For all the reviewers: the new push performs some additional refactoring so the new logic is more self-contained and can be more easily replaced or further unified in the future. I have considered many approaches and this is the only one that would work.

@RandomShaper RandomShaper force-pushed the revamp_languages_exit branch 6 times, most recently from 16d431b to f96ab59 Compare September 16, 2024 12:02
@CedNaru

CedNaru commented Sep 16, 2024

Copy link
Copy Markdown
Contributor

Indeed, that quite a bit of complexity added to handle the ScriptServer within the pool. If I may ask, wouldn't that be possible to invert the logic and make that the ScriptServer, the one with multiple "run levels" instead of the WorkerThreadPool ?

The core of the issue is that threads are started before the ScriptServer, and terminated after the ScriptServer. There are probably a lot of reasons why the ScriptServer has to be initialized last and ended first (I can already think of a few myself), Wouldn't a two-stage (before and after servers) initialization and termination of the ScriptServer do the trick ?

Sadly, I don't see it being possible without breaking changes to Language (and would probably force languages to register at the core phase at least) but asking anyway.

- The main thread function and the collaborative wait functions have a much more similar structure than earlier, which yields (pun intended) better maintainability.
- Also, there are not assertions anymore about the reason for ending a wait being valid, because spurious awakes can happen and so the assert would fail without that indicating an issue.
@RandomShaper

Copy link
Copy Markdown
Member Author

Sounds intriguing, but I'm not sure it'd be worth the hassle at this point. I've simplified a bit the code so the end result is not that daunting and, provided it does the trick and maintainability is good, I'd deem this one worth mergeable. Let's see if CI succeeds now (there was a missing initializer causing logic errors).

@adamscott

adamscott commented Sep 16, 2024

Copy link
Copy Markdown
Member

Tested out a single-thread export with this PR and it fixes the issues that currently occur on master.

@RandomShaper You could add this PR as fixing #97073.

@clayjohn clayjohn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and confirmed that it is working for the MT servers

@clayjohn

Copy link
Copy Markdown
Member

I'm going to go ahead and merge this as we accidentally merged #95915 first which will introduce the deadlock to many more projects.

@clayjohn clayjohn merged commit 48403b5 into godotengine:master Sep 16, 2024
@clayjohn

Copy link
Copy Markdown
Member

Thank you for the quick turnaround!

@RandomShaper RandomShaper deleted the revamp_languages_exit branch September 17, 2024 07:07
@RandomShaper

Copy link
Copy Markdown
Member Author

Added a note about cherry-picking for 4.3 to the main description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug cherrypick:4.3 Considered for cherry-picking into a future 4.3.x release topic:core

Projects

None yet

4 participants