Reliability fixes for the Thread Pool #121887

VSadov · 2025-11-21T19:22:14Z

When inserting workitems into threadpool queues, we must always guarantee that for every workitem there will be some worker at some point in the future that will certainly notice the presence of the workitem and executes it.

There was an attempt to relax the requirements in #100506.
Sadly, it leads to occasional deadlocks when items are present in the work queues and no workers are coming to pick them up.

The same change was made in all 3 threadpools - IO completion, Sockets and the general purpose ThreadPool. The fix is applied to all three threadpools.

We have seen reports about deadlocks when running on net9 or later releases:

In Windows IO completion: Disk IO completions getting lost #121608
In general purpose ThreadPool: Dotnet process running, but no reaction is shown #119043
In Unix Sockets: No known reports so far. Perhaps we have not yet seen a case where adding more workitems is indirectly conditioned on existing workitems executed. Without that incoming workitems may "unwedge" the threadpool, and it is just a pause vs. a deadlock, so could be unnoticed.

The fix will need to be ported to net10 and net9. Thus this PR tries to restore just the part which changed the enqueuers/workers handshake algorithm.
More stuff was piled up into threadpool since the change, so doing the minimal fix without disturbing the rest is somewhat tricky.

Fixes: #121608 (definitely, I have tried with the repro)
Fixes: #119043 (likely, I do not have a repro to try, but symptoms seem like from the same issue)

VSadov · 2025-11-21T23:50:09Z

I think this is ready for review.

Copilot

Pull request overview

This PR addresses critical reliability issues in three thread pool implementations by reverting problematic changes from #100506 that led to deadlocks in .NET 9. The fix restores a simpler and safer enqueuer/worker handshake protocol that guarantees work items will always have a worker thread available to process them.

Key Changes:

Simplified the QueueProcessingStage enum by removing the Determining state, leaving only NotScheduled and Scheduled
Changed the worker thread protocol to reset the processing stage to NotScheduled before checking for work items (preventing a race condition window)
Removed complex retry/dequeue logic and the _nextWorkItemToProcess optimization in favor of always requesting an additional worker when processing an item

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs	Simplified general-purpose ThreadPool enqueuer/worker handshake by removing `Determining` state, `_nextWorkItemToProcess` field, and complex retry logic; streamlined `Dispatch()` to always request a worker after dequeuing an item;
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs	Applied same handshake simplification to Unix socket async engine; removed `UpdateEventQueueProcessingStage()` method; simplified `Execute()` to use consistent pattern of resetting state before checking queue

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

…ToHighPriorityGlobalQueue. them

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

VSadov · 2025-11-23T17:36:46Z

/benchmark plaintext,json,fortunes aspnet-citrine-lin runtime,libs

pr-benchmarks · 2025-11-23T17:44:07Z

Benchmark started for plaintext, json, fortunes on aspnet-citrine-lin with runtime, libs. Logs: link

pr-benchmarks · 2025-11-23T17:47:45Z

An error occurred, please check the logs

mangod9 · 2025-11-24T16:21:13Z

Is this just a revert of the earlier change? Looks like that had some perf improvements so we might notice some regressions due to revert.

VSadov · 2025-11-24T16:55:28Z

Is this just a revert of the earlier change?

Yes, it is a partial revert.

Looks like that had some perf improvements so we might notice some regressions due to revert.

There were several changes in the original PR and the result was some improvements and also couple regressions. This PR reverts only the part that is important for correctness. Not sure how much that was contributing.
It is possible that we will see some regressions.

VSadov · 2025-11-24T17:02:21Z

There are ways to make threadpool less eager with introducing workers. But that would need to be in the part that actually controls the introducing of threads - around the LIFO semaphore and the logic that controls it.
This flag is just a handshake by which enqueuers can signal a presence of new items and that some thread needs to come and pick the items up.

In the airport parking lot shuttle analogy - The parking lot decides how many shuttles they have in rotation. But this flag is just calling them and telling that a traveler has arrived and needs to be picked up, somehow, eventually...

eduardo-vp · 2025-11-24T22:24:41Z

The changes look good to me, actually the current logic to handle thread requests is quite complicated so it was hard to be 100% sure that it worked correctly. However since the current thread request handling was introduced to deal with some issues around using more CPU in certain scenarios that we may want to address after this PR is merged, should we add a test similar to the one in #121608 such that it's more likely to detect if work items are not getting picked up?

VSadov · 2025-11-24T22:51:37Z

should we add a test similar to the one in #121608 such that it's more likely to detect if work items are not getting picked up?

The repro involves two processes: a client and a server. They may run for quite a while before hanging - could be minutes. And the time-to-hang appears can depend on CPU manufacturer/model or number of cores. It is a good real-world-like sample for running locally as a stress test, but hardly useful as a CI test.

I think we have some existing tests that look for TP items not being picked up, but since this issue requires rare races, it could be that tests cannot detect it on the kind of hardware the lab runs them (or catch very rarely).

Maybe we should think of some kind of "stress suite" for the thread pool. At least keep a collection of apps known to have stress issues in the past. Like this example.

VSadov · 2025-11-24T23:02:01Z

I was curious about what perf effect here could be and tried using /benchmark command, but somehow it did not work. Perhaps I used it incorrectly.
I can`t figure much from the logs either.

It is not blocking, but it seems the right tool for this kind of queries.

VSadov · 2025-11-24T23:51:07Z

I'll try one more time with benchmar. Maybe it was some transient infra issue.
I am mostly just curious if there is any impact.

VSadov · 2025-11-24T23:51:11Z

/benchmark plaintext,json,fortunes aspnet-citrine-lin runtime,libs

pr-benchmarks · 2025-11-25T00:08:26Z

Benchmark started for plaintext, json, fortunes on aspnet-citrine-lin with runtime, libs. Logs: link

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

VSadov · 2025-12-02T19:55:43Z

I think I've addressed all the concerns/questions. Let me know if I missed something.

VSadov · 2025-12-02T22:53:46Z

Thanks!!

snakefoot · 2025-12-08T19:08:03Z

Any plans for backporting to NET9 / NET10 ? Or waiting for performance reports like #122186 ?

VSadov · 2025-12-09T00:11:36Z

Any plans for backporting to NET9 / NET10 ? Or waiting for performance reports like #122186 ?

Yes, will be ported to net10 and net9. Just letting the fix to run for a few days to be sure the fix does not cause some unintended effects.

Fixes: #121608 Backport of #121887 The ThreadPool in rare cases allows a scenario when an enqueued workitem is not guaranteed to be executed unless more workitems are enqueued. In some scenarios execution of a particular workitem may be necessary before more work is enqueued, thus leading to a deadlock. This is a subtle regression introduced by a change in enqueuer/worker handshake algorithm. The same pattern is used in 2 other ThreadPool-like internal features in addition to the ThreadPool. ## Customer Impact - [ ] Customer reported - [x] Found internally ## Regression - [x] Yes - [ ] No ## Testing Standard test pass for deterministic regressions. A targeted stress application that demonstrates the issue. (no fix: hangs within 1-2 minutes, with the fix: runs for 20+ min. on the same system) ## Risk Low. This is a revert to the preexisting algorithm for the enqueuer/worker handshake. (in all 3 places) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Gocke <angocke@microsoft.com>

…122362) Fixes: #121608 Backport of #121887 The ThreadPool in rare cases allows a scenario when an enqueued workitem is not guaranteed to be executed unless more workitems are enqueued. In some scenarios execution of a particular workitem may be necessary before more work is enqueued, thus leading to a deadlock. This is a subtle regression introduced by a change in enqueuer/worker handshake algorithm. The same pattern is used in 2 other ThreadPool-like internal features in addition to the ThreadPool. ## Customer Impact - [ ] Customer reported - [x] Found internally ## Regression - [x] Yes - [ ] No ## Testing Standard test pass for deterministic regressions. A targeted stress application that demonstrates the issue. (no fix: hangs within 1-2 minutes, with the fix: runs for 20+ min. on the same system) ## Risk Low. This is a revert to the preexisting algorithm for the enqueuer/worker handshake. (in all 3 places)

This is a follow up on recommendations from Scalability Experiments done some time ago. The Scalability Experiments resulted in many suggestions. In this part we look at overheads of submitting and executing a workitem to the threadpool from the thread scheduling point of view. In particular - this PR tries to minimize changes to the workqueue to scope the changes. The workqueue related recommendations will be addressed separately. The threadpool parts are very interconnected though, and sometimes removing one bottleneck results in another one to show up, so some workqueue changes had to be done, just to avoid regressions. There are also a few "low hanging fruit" fixes for per-workitem overheads like unnecessary fences or too frequent modifications of shared state. Hopefully this will negate some of the regressions from #121887 (as was reported in #122186) In this change: - fewer operations per work item where possible. such as fewer/weaker fences where possible, reporting heartbeat once per dispatch quantum vs. per each workitem, etc.. - avoid spurious wakes of worker threads. (except, unavoidably, when thread goal is changed - by HillClimb and such). only one thread is requested at a time. requesting another thread is conditioned on evidence of work present in the queue (basically the minimum required for correctness). as a result a thread that becomes active typically finds work. in particular this avoids a cascade of spurious wakes when pool is running out of workitems. - stop tracking spinners count in LIFO semaphore. we can keep track of spinners, but informational value of knowing the extremely transient count is close to zero, so we should not. - no Sleep in LIFO semaphore. using spin-Sleep is questionable in a synchronization feature that can block and ask OS to wake a thread deterministically. - shortening spinning in the LIFO semaphore to a more affordable value. since the LIFO semaphore can perform a blocking wait until condition it wants to see happens, once spinning gets into the range of wait/wake latency, it makes no sense to spin for much longer. it is also not uncommon that the work is introduced by non-pool threads, thus the pool threads may have to block just to allow for more work to be scheduled. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

VSadov added the area-System.Threading label Nov 21, 2025

dotnet-policy-service bot assigned VSadov Nov 21, 2025

build-analysis bot mentioned this pull request Nov 21, 2025

System.Text.Json.Tests fails with GC hole assert #117186

Open

VSadov added 2 commits November 21, 2025 18:16

Reliability fixes for ThreadPool

a823f91

Same fix for the Sockets

f20fe64

VSadov force-pushed the tpFix11 branch from f4d3578 to f20fe64 Compare November 22, 2025 02:17

VSadov marked this pull request as ready for review November 22, 2025 02:30

Copilot AI review requested due to automatic review settings November 22, 2025 02:30

Copilot started reviewing on behalf of VSadov November 22, 2025 02:31 View session

VSadov requested a review from stephentoub November 22, 2025 02:33

Copilot finished reviewing on behalf of VSadov November 22, 2025 02:33

Copilot AI reviewed Nov 22, 2025

View reviewed changes

VSadov and others added 2 commits November 21, 2025 18:48

No need to revert net11 vs. net9 changes in TransferAllLocalWorkItems…

28ca62b

…ToHighPriorityGlobalQueue. them

Apply suggestions from code review (typos)

9f6857d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mangod9 requested a review from eduardo-vp November 24, 2025 16:19

dotnet deleted a comment from pr-benchmarks bot Nov 25, 2025

stephentoub reviewed Nov 26, 2025

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 26, 2025

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Show resolved Hide resolved

stephentoub reviewed Nov 26, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs Show resolved Hide resolved

Replaced the QueueProcessingStage enum with an int.

64e0180

VSadov changed the title ~~Reliability fixes for the Thread Pools~~ Reliability fixes for the Thread Pool Dec 2, 2025

stephentoub approved these changes Dec 2, 2025

View reviewed changes

VSadov merged commit f204e02 into dotnet:main Dec 2, 2025
143 checks passed

VSadov deleted the tpFix11 branch December 2, 2025 22:56

dotnet-maestro bot mentioned this pull request Dec 3, 2025

[main] Source code updates from dotnet/runtime dotnet/dotnet#3448

Merged

LoopedBard3 mentioned this pull request Dec 4, 2025

[Perf] Linux/arm64: 10 Regressions on 12/3/2025 12:01:02 AM +00:00 #122186

Open

jkotas mentioned this pull request Dec 6, 2025

Move CoreCLR over to the managed wait subsystem #117788

Merged

This was referenced Dec 9, 2025

[release/10.0] Backport Reliability fixes for the Thread Pool #122320

Merged

A few local optimizations for ThreadPool #122333

Closed

[release/9.0-staging] Backport Reliability fixes for the Thread Pool #122362

Merged

VSadov mentioned this pull request Dec 24, 2025

[ThreadPool] Scalability Experiments follow up. Part1 #122726

Merged

github-actions bot locked and limited conversation to collaborators Jan 8, 2026

Reliability fixes for the Thread Pool #121887

Reliability fixes for the Thread Pool #121887

Uh oh!

Conversation

VSadov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment • edited by VSadov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VSadov commented Nov 23, 2025

Uh oh!

pr-benchmarks bot commented Nov 23, 2025

Uh oh!

pr-benchmarks bot commented Nov 23, 2025

Uh oh!

mangod9 commented Nov 24, 2025

Uh oh!

VSadov commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduardo-vp commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Nov 24, 2025

Uh oh!

pr-benchmarks bot commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VSadov commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Dec 2, 2025

Uh oh!

Uh oh!

snakefoot commented Dec 8, 2025

Uh oh!

VSadov commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VSadov commented Nov 21, 2025 •

edited

Loading

VSadov commented Nov 21, 2025 •

edited

Loading

Copilot AI left a comment •

edited by VSadov

Loading

VSadov commented Nov 24, 2025 •

edited

Loading

VSadov commented Nov 24, 2025 •

edited

Loading

eduardo-vp commented Nov 24, 2025 •

edited

Loading

VSadov commented Nov 24, 2025 •

edited

Loading

VSadov commented Nov 24, 2025 •

edited

Loading

VSadov commented Nov 24, 2025 •

edited

Loading

VSadov commented Dec 2, 2025 •

edited

Loading