Skip to content

Fix TraceEngine file contention deadlock in multithreaded mode#13446

Merged
JanProvaznik merged 7 commits intodotnet:mainfrom
JanProvaznik:fix/mt-trace-deadlock
Apr 1, 2026
Merged

Fix TraceEngine file contention deadlock in multithreaded mode#13446
JanProvaznik merged 7 commits intodotnet:mainfrom
JanProvaznik:fix/mt-trace-deadlock

Conversation

@JanProvaznik
Copy link
Copy Markdown
Member

@JanProvaznik JanProvaznik commented Mar 24, 2026

In multithreaded (-mt) mode, multiple in-proc BuildRequestEngine instances run in the same process. Each engine's TraceEngine() method wrote to the same EngineTrace_{PID}.txt file using lock(this), which only serialized within that single instance. When two engines traced concurrently, the second FileStream open (with FileShare.Read) threw IOException, which propagated through QueueAction's error handler and fatally crashed the ActionBlock work queue. All subsequent Post() calls were silently rejected, causing request completions to be dropped and the scheduler to deadlock.

Root Cause

In -mt mode (same process, shared PID):

  Engine A: lock(this_A) -> opens EngineTrace_{PID}.txt  OK
  Engine B: lock(this_B) -> opens EngineTrace_{PID}.txt  IOException
                                                         |
                                            QueueAction catch -> throw
                                                         |
                                            ActionBlock faulted -> deadlock

lock(this) only serializes within ONE engine instance. Cross-instance contention on the shared file was unprotected.

Fix

  • Replace lock(this) with a static lock (s_traceLock) so all engine instances in the same process serialize trace writes to a single file. This keeps one unified trace file (better debugging UX than per-node files).
  • Catch non-critical exceptions in TraceEngine() using catch (Exception e) when (!ExceptionHandling.IsCriticalException(e)), matching the defensive pattern already used by Scheduler.TraceScheduler (Scheduler.cs:2683). Trace file failures must never crash the build engine.

Testing

Added regression test MultiThreadedBuild_WithDebugSchedulerTracing_DoesNotDeadlock:

  • Enables MSBUILDDEBUGSCHEDULER=1 during a multithreaded build with 4 parallel child projects
  • Uses [Fact(Timeout = 30_000)] so a deadlock regression fails fast instead of hanging CI
  • Verified on both net10.0 and net472

Red-green validated locally: without the fix, the test aborts with IOException: The process cannot access the file 'EngineTrace_{PID}.txt' because it is being used by another process followed by InternalErrorException: Failure during engine shutdown. With the fix, the test passes in ~400ms.

JanProvaznik and others added 2 commits March 24, 2026 13:26
In multithreaded (-mt) mode, multiple in-proc node engines run in the
same process. Each engine's TraceEngine() method wrote to the same
EngineTrace_{PID}.txt file with FileShare.Read. When two engines traced
concurrently, the second got an IOException, which propagated through
QueueAction's error handler and fatally crashed the ActionBlock work
queue. All subsequent Post() calls were silently rejected, causing
request completions to be dropped and the scheduler to deadlock.

Fix:
- Use per-node trace files: EngineTrace_{PID}_{NodeId}.txt
- Catch IOException in TraceEngine so trace failures can never crash
  the build engine

Added regression test that enables MSBUILDDEBUGSCHEDULER=1 during a
multithreaded build with parallel child projects.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In -mt mode, multiple in-proc BuildRequestEngine instances share one process.
Each engine's TraceEngine() used lock(this), which only serialized within that
instance. Concurrent FileStream opens to EngineTrace_{PID}.txt with
FileShare.Read caused IOException, which fatally crashed the ActionBlock work
queue and caused a scheduler deadlock.

Fix:
- Replace lock(this) with a static lock (s_traceLock) so all engines in the
  same process serialize trace writes to a single file.
- Catch non-critical exceptions in TraceEngine, matching the defensive pattern
  already used by Scheduler.TraceScheduler.
- Add regression test with Timeout=30s to fail fast on deadlock.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JanProvaznik JanProvaznik marked this pull request as ready for review March 31, 2026 08:42
Copilot AI review requested due to automatic review settings March 31, 2026 08:42
@JanProvaznik JanProvaznik self-assigned this Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a multithreaded (-mt) in-proc deadlock path caused by concurrent TraceEngine() writes to a shared per-process trace file, which could fault the ActionBlock work queue and stall the scheduler.

Changes:

  • Serialize BuildRequestEngine.TraceEngine() file writes across all in-proc engine instances via a static lock.
  • Make TraceEngine() resilient by swallowing non-critical exceptions during trace emission.
  • Add a regression unit test that runs an -mt build with debug scheduler tracing enabled to ensure the build completes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Build/BackEnd/Components/BuildRequestEngine/BuildRequestEngine.cs Uses a process-wide static lock for engine trace file writes and prevents trace IO failures from crashing the engine.
src/Build.UnitTests/BackEnd/BuildManager_Tests.cs Adds a regression test targeting the previous -mt trace-file contention/deadlock scenario.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
JanProvaznik and others added 3 commits March 31, 2026 12:33
FrameworkDebugUtils.SetDebugPath() is parameterless; it reads from the
MSBUILDDEBUGPATH environment variable which is already set on the
previous line. The incorrect SetDebugPath(debugPath) call caused CS1501
across all CI legs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a non-critical exception occurs writing the engine trace file,
log the failure message and stack trace at Low importance through
_nodeLoggingContext (if available) so it appears in diagnostic-level
build output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xUnit only supports Timeout on async tests. Wrap the test body in
Task.Run and return async Task. Use fully-qualified Task type to avoid
ambiguity with Microsoft.Build.Utilities.Task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gine.cs

Co-authored-by: AR-May <67507805+AR-May@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants