fix: resolve parallel dotnet publish race in Aspire deploys#8195
Conversation
📋 Prioritization NoteThanks for the contribution! The linked issue isn't in the current milestone yet. |
There was a problem hiding this comment.
Pull request overview
Fixes #8177 by isolating each Aspire service's dotnet publish intermediate output via --artifacts-path <per-service-temp-dir>, so concurrent publishes can no longer race on the shared obj/ of common <ProjectReference> projects. The previous "first-wins" build-gate that produced graph-level deploy→deploy edges is replaced by a runtime mutex carried in context, used only as a fallback when the per-service temp dir cannot be created. Tests are updated to reflect that Aspire deploys now run fully parallel at the graph level.
Changes:
- Add
--artifacts-pathplumbing todotnet publishvia context (ContextWithArtifactsPath/ArtifactsPathFromContext+appendArtifactsPath) and a newbuild_gate.gocarrying a fallback mutex. - Rewire
dotnetContainerAppTarget.Deployto create a per-service temp artifacts dir (cleaned up viadefer os.RemoveAll) and to lock the build-gate mutex only aroundprepareContainerImageif temp-dir creation fails. - Replace the build-gate edges in
addServiceStepsToGraphwith a per-key*sync.Mutexinjected into each gated deploy step's context, and update tests to assert no inter-deploy graph edges.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/pkg/tools/dotnet/dotnet.go | Adds appendArtifactsPath plus context get/set helpers, and threads them into BuildContainerLocal and PublishContainer. |
| cli/azd/pkg/project/service_target_dotnet_containerapp.go | Creates per-service temp artifacts dir, injects via ctx for image prep; falls back to acquiring the build-gate mutex only around prepareContainerImage on mkdir failure. |
| cli/azd/pkg/project/build_gate.go | New file: context get/set helpers for the fallback build-gate mutex. |
| cli/azd/internal/cmd/service_graph.go | Switches gate semantics from graph-edge "first-wins" to a per-key runtime mutex injected into each gated deploy step's ctx. |
| cli/azd/internal/cmd/service_graph_test.go | Adds a topology test asserting Aspire deploy steps share no graph-level deploy→deploy edges. |
| cli/azd/internal/cmd/deploy_graph_test.go | Renames/rewrites Aspire ordering tests to assert no inter-deploy edges; removes the multi-key generic gate test and drops buildGateKey from buildDeployGraph. |
| cli/azd/internal/cmd/aspire_gate.go | Updates doc comment to describe runtime-mutex semantics rather than first-wins serialization. |
| cli/azd/internal/cmd/deploy.go | Updates deployServicesGraph doc to describe runtime mutex over Bridge step, not graph serialization. |
Copilot's findings
Comments suppressed due to low confidence (1)
cli/azd/pkg/project/service_target_dotnet_containerapp.go:180
BuildGateFromContext(ctx) != nilis being used here as a proxy for "this is an Aspire-gated service that needs per-service artifacts isolation." That coupling is not obvious from either side: thedotnetpackage has no notion of build-gate-ness, and theprojectpackage'sBuildGateFromContextdoc doesn't say its presence is the trigger for choosing the artifacts-path strategy. If a future caller injects a build gate for a different reason, or removes the gate but still wants race-safe parallel publishes (e.g., a non-Aspire .NET multi-service deploy), this branch will silently misbehave: no temp dir is created, no--artifacts-pathis passed, and parallel publishes will race again. Consider either (a) always creating the per-service artifacts dir fordotnetContainerAppTarget.Deployregardless of the gate, or (b) gating on something more semantically explicit (e.g., a dedicated flag in ctx or on the service config) and keepingBuildGateFromContextstrictly as the fallback-mutex signal.
imageCtx := ctx
var buildGateMu *sync.Mutex // non-nil only in the mutex fallback path
if BuildGateFromContext(ctx) != nil {
// Create a per-service temp directory for isolated intermediate outputs.
artifactsDir, mkdirErr := os.MkdirTemp("", "azd-artifacts-"+serviceConfig.Name+"-")
if mkdirErr == nil {
defer os.RemoveAll(artifactsDir)
imageCtx = dotnet.ContextWithArtifactsPath(ctx, artifactsDir)
} else {
// Fallback: use the mutex to serialize image preparation only.
// We lock/unlock explicitly (not defer) so the Azure deployment
// portion that follows runs in parallel.
log.Printf("warning: failed to create artifacts temp dir for %s: %v; falling back to serial build",
serviceConfig.Name, mkdirErr)
buildGateMu = BuildGateFromContext(ctx)
}
}
- Files reviewed: 8/8 changed files
- Comments generated: 4
d127406 to
56bb061
Compare
56bb061 to
662bef9
Compare
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (3)
cli/azd/pkg/project/service_target_dotnet_containerapp.go:173
- The comment claims the temp-dir prefix is kept short to avoid MAX_PATH issues on Windows, but
safeNameis the full service name with no length truncation. Aspire service names can be long (the issue itself referencesmyapp-apps-migrationconsole, etc.). Combined withos.MkdirTemp's random suffix and MSBuild's deepobj/<config>/<tfm>/<rid>/publish/...nesting underneath, paths can still approach Windows's 260-char MAX_PATH limit and re-introduce the very failure mode this PR is fixing. Consider truncatingsafeName(e.g. to ~16 chars or a short hash) or updating the comment to acknowledge that no truncation actually happens.
// Create a per-service temp directory for isolated intermediate outputs.
// Sanitize the service name to safe filesystem characters and keep the
// prefix short to avoid MAX_PATH issues on Windows when MSBuild nests
// obj/<config>/<tfm>/... underneath.
safeName := sanitizeTempDirName(serviceConfig.Name)
artifactsDir, mkdirErr := os.MkdirTemp("", "azd-"+safeName+"-")
cli/azd/pkg/project/service_target_dotnet_containerapp.go:189
- The temp artifacts directory is created unconditionally whenever a build gate is present, but
prepareContainerImagehas three branches (dockerfile.v0, container.v0 pre-built image reference, and project.v0). Only the project.v0 branch actually invokesdotnet publish; dockerfile- and pre-built-image services in the same Aspire gate group will still get a temp dir created and deleted with no benefit. Consider gating the temp-dir creation onserviceConfig.Language != ServiceLanguageDocker && serviceConfig.DotNetContainerApp.ContainerImage == ""(or moving the artifacts-path setup closer toprepareDotNetProjectImage) to avoid the wasted filesystem work and cleanup on every parallel deploy.
if BuildGateFromContext(ctx) != nil {
// Create a per-service temp directory for isolated intermediate outputs.
// Sanitize the service name to safe filesystem characters and keep the
// prefix short to avoid MAX_PATH issues on Windows when MSBuild nests
// obj/<config>/<tfm>/... underneath.
safeName := sanitizeTempDirName(serviceConfig.Name)
artifactsDir, mkdirErr := os.MkdirTemp("", "azd-"+safeName+"-")
if mkdirErr == nil {
defer func() {
if rmErr := os.RemoveAll(artifactsDir); rmErr != nil {
log.Printf("warning: failed to remove artifacts temp dir %s: %v", artifactsDir, rmErr)
}
}()
imageCtx = dotnet.ContextWithArtifactsPath(ctx, artifactsDir)
} else {
// Fallback: use the mutex to serialize image preparation only.
// We lock/unlock explicitly (not defer) so the Azure deployment
// portion that follows runs in parallel.
log.Printf("warning: failed to create artifacts temp dir for %s: %v; falling back to serial build",
serviceConfig.Name, mkdirErr)
buildGateMu = BuildGateFromContext(ctx)
}
}
cli/azd/pkg/project/service_target_dotnet_containerapp.go:189
- When
os.MkdirTempfails for one service but succeeds for sibling services in the same gate group, only the failing service acquires the shared mutex; the others run theirdotnet publishin parallel using their isolated--artifacts-pathdirectories. That is fine in isolation, but ifos.MkdirTempfails for two or more siblings simultaneously (e.g., disk-full affects all of them), those siblings serialize against each other while the successful siblings still publish to their isolated dirs in parallel — which is correct. However, the fallback comment "use the mutex to serialize image preparation only" overlooks that the mutex is shared across the whole gate group, so a single failing service will block on the mutex even though no other service is actually holding it (the successful siblings never lock). This is harmless but worth clarifying in the comment to avoid future confusion about the fallback's semantics.
} else {
// Fallback: use the mutex to serialize image preparation only.
// We lock/unlock explicitly (not defer) so the Azure deployment
// portion that follows runs in parallel.
log.Printf("warning: failed to create artifacts temp dir for %s: %v; falling back to serial build",
serviceConfig.Name, mkdirErr)
buildGateMu = BuildGateFromContext(ctx)
}
}
- Files reviewed: 10/10 changed files
- Comments generated: 2
When multiple Aspire services sharing <ProjectReference> dependencies are deployed in parallel, their concurrent dotnet publish invocations race on shared obj/ directories, producing MSB4018/IOException errors. Fix: pass --artifacts-path <per-service-temp-dir> to dotnet publish, which redirects ALL intermediate build outputs (obj/, bin/) for the project and its transitive references to an isolated directory. This eliminates the file race entirely while preserving full deploy parallelism. A mutex-based fallback (serializes only the dotnet publish phase) activates if the temp directory cannot be created. Non-Aspire projects are unaffected: the artifacts path and build gate are only injected when the service has a DotNetContainerApp configuration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
662bef9 to
2ccd3f9
Compare
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
wbreza
left a comment
There was a problem hiding this comment.
Code Review — PR #8195
fix: resolve parallel dotnet publish race in Aspire deploys
✅ What Works Well
- Clean two-tier design: Primary approach (
--artifacts-pathisolation) enables full parallelism; fallback (mutex serialization) handles edge cases gracefully - Good fallback logging:
log.Printfwarns when falling back to serial build — no silent failures - Explicit lock/unlock: Mutex released before Azure deployment begins, preserving parallelism where it matters
- Solid happy-path test coverage:
Test_Cli_ArtifactsPathContexttestsPublishContainerandBuildContainerLocalwith and without context - Clean context API:
ContextWithArtifactsPath/ArtifactsPathFromContextfollow standard Go context patterns
🔴 P1 — Significant
1. Windows MAX_PATH: Comment claims truncation but doesn't implement it
📍 service_target_dotnet_containerapp.go:172-173
The comment says "keep the prefix short to avoid MAX_PATH issues on Windows" but safeName is the full service name with no length truncation. With a long Aspire service name (e.g., myapp-apps-migrationconsole-worker), the temp path could exceed Windows' 260-char MAX_PATH, re-introducing the very failure mode the PR fixes.
Suggestion: Truncate safeName to ~20 chars or use a short hash.
🟡 P2 — Improvements
2. No test for mutex fallback path
📍 build_gate_test.go / service_target_dotnet_containerapp.go
No test simulates os.MkdirTemp failure to verify the mutex fallback works. This is a critical safety net with zero test coverage.
3. Tests verify topology but don't prove parallel execution
📍 service_graph_test.go / deploy_graph_test.go
Tests assert no deploy→deploy graph edges but never verify services actually execute concurrently. A test with sync primitives would prove the intended behavior.
4. sanitizeTempDirName missing edge cases
📍 build_gate_test.go:27-42
Missing: empty string, all-unsafe-characters, very long names (250+ chars with no truncation).
5. BuildGateFromContext coupling
📍 service_target_dotnet_containerapp.go:167
BuildGateFromContext(ctx) != nil is used as a proxy for "needs artifacts isolation," conflating the mutex mechanism with the detection of Aspire-gated services. A future non-Aspire .NET multi-service deploy needing isolation wouldn't trigger this path. Consider a more semantic signal.
6. Hidden cross-package context contract
📍 service_graph.go → build_gate.go → dotnet.go
Three packages coupled via implicit context keys. Consider adding a doc comment in build_gate.go documenting who sets these values, when, and what happens when absent.
🔵 P3 — Nits
- Naming:
BuildGateFromContextdoesn't convey "fallback" — considerFallbackBuildMutexFromContext - Collision potential:
sanitizeTempDirNamemaps different names to the same output (harmless due to random suffix but could confuse logs) - API ambiguity:
ArtifactsPathFromContextreturns""for both "not set" and "empty" — a(string, bool)return would be more idiomatic
📊 Summary
| Priority | Count | Key Concern |
|---|---|---|
| P1 | 1 | Windows MAX_PATH — truncation claimed but not implemented |
| P2 | 5 | Missing fallback tests, no parallelism proof, edge cases, coupling |
| P3 | 3 | Naming clarity, collision potential, API ambiguity |
Overall: Well-designed PR with a clean two-tier approach. The P1 (MAX_PATH) is a real risk on Windows. P2s are quality improvements suitable for a follow-up.
…8195) When multiple Aspire services sharing <ProjectReference> dependencies are deployed in parallel, their concurrent dotnet publish invocations race on shared obj/ directories, producing MSB4018/IOException errors. Fix: pass --artifacts-path <per-service-temp-dir> to dotnet publish, which redirects ALL intermediate build outputs (obj/, bin/) for the project and its transitive references to an isolated directory. This eliminates the file race entirely while preserving full deploy parallelism. A mutex-based fallback (serializes only the dotnet publish phase) activates if the temp directory cannot be created. Non-Aspire projects are unaffected: the artifacts path and build gate are only injected when the service has a DotNetContainerApp configuration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: therealjohn <1501196+therealjohn@users.noreply.github.com>
Fixes #8177
Problem
When multiple Aspire services sharing
<ProjectReference>dependencies are deployed in parallel, their concurrentdotnet publishinvocations race on sharedobj/directories, producing MSB4018/IOException errors.Approach
Pass
--artifacts-path <per-service-temp-dir>todotnet publish, redirecting ALL intermediate build outputs (obj/, bin/) to an isolated directory per service. This eliminates the file race entirely while preserving full deploy parallelism.Why --artifacts-path?
Fallback: If temp dir creation fails (disk full, permissions), a mutex serializes only the
dotnet publishphase while the Azure deployment portion remains parallel.Changes
dotnet.go—appendArtifactsPath()+ context helpers (ContextWithArtifactsPath/ArtifactsPathFromContext)service_target_dotnet_containerapp.go— Deploy() creates per-service temp dir, injects via context; explicit lock/enable for mutex fallback (not defer, to avoid holding across Azure deployment)build_gate.go(new) — Context helpers for build gate mutexservice_graph.go— Injects mutex via context instead of graph edges; deploy steps remain parallel at graph levelNon-Aspire impact
None. The artifacts path and build gate are only injected when the service has a
DotNetContainerAppconfiguration. Non-Aspire services get vanilla context with no behavioral change.Future optimization
A
dotnet build --artifacts-pathpass once upfront, followed bydotnet publish --no-build, would eliminate redundant shared-library compilation. Left for a follow-up since the current approach is correct and the redundant compilation is fast relative to the Azure deployment step.