Skip to content

Fix workdir collision for component instances sharing base component#2093

Merged
aknysh merged 11 commits intomainfrom
copilot/fix-parallel-component-instances
Mar 3, 2026
Merged

Fix workdir collision for component instances sharing base component#2093
aknysh merged 11 commits intomainfrom
copilot/fix-parallel-component-instances

Conversation

Copy link
Contributor

Copilot AI commented Feb 20, 2026

Closes #2091

Problem

When multiple component instances share the same Terraform component (metadata.component), they cannot be applied in parallel because they all write to the same workdir. This is because the workdir path was generated using metadata.component (the base component name) instead of the component instance name.

Example scenario that was broken:

  • 12 ElastiCache clusters, all with metadata.component: elasticache
  • All instances mapped to .workdir/terraform/dev-elasticache
  • Parallel execution caused lock contention and file conflicts

Solution

Modified the workdir path generation in pkg/provisioner/workdir/workdir.go to use atmos_component (the full component instance path) instead of metadata.component. This ensures each component instance gets its own unique workdir:

  • .workdir/terraform/dev-elasticache-redis-cluster-1
  • .workdir/terraform/dev-elasticache-redis-cluster-2
  • .workdir/terraform/dev-elasticache-redis-cluster-3

The fix includes:

  1. Prioritize atmos_component for workdir path generation
  2. Fallback to extractComponentName() for backward compatibility
  3. Comprehensive test coverage for component instances sharing the same base component

Changes

  • pkg/provisioner/workdir/workdir.go: Updated component name extraction to use atmos_component first
  • pkg/provisioner/workdir/integration_test.go: Added TestComponentInstancesWithSameBaseComponent test

Total changes: 2 files, 118 insertions, 3 deletions

Testing

  • All existing workdir unit tests pass (30+ tests)
  • New test validates that 3 component instances sharing the same base component get unique workdirs
  • Verified backward compatibility with the fallback mechanism
  • Clean rebase on latest main

Impact

Enables parallel terraform apply for component instances like:

components:
  terraform:
    elasticache-redis-cluster-1:
      metadata:
        component: elasticache  # Base component
      provision:
        workdir:
          enabled: true

    elasticache-redis-cluster-2:
      metadata:
        component: elasticache  # Same base component
      provision:
        workdir:
          enabled: true

Before: Both instances → .workdir/terraform/dev-elasticache (shared, conflicts)
After: Unique workdirs → .workdir/terraform/dev-elasticache-redis-cluster-{1,2} (isolated)

Each instance now gets its own .terraform/ directory, lock files, and generated configs—no coordination required.

Original prompt

This section details on the original issue you should resolve

<issue_title>Parallel apply of multiple component instances sharing the same Terraform component is not possible</issue_title>
<issue_description>### Describe the Feature

Atmos cannot apply multiple component instances that share the same Terraform component (metadata.component) in parallel. All instances write to the same component source directory, causing lock contention, checksum races, and corrupted provider binaries. The existing provision.workdir.enabled feature does not solve this — it isolates by <stack>-<component>, so all instances of the same component within the same stack still share one workdir.

Expected Behavior

Running multiple atmos terraform apply commands in parallel for component instances that share the same base component should work without file conflicts. Each instance already has its own Terraform workspace and separate remote state — the only barrier is local filesystem contention that atmos should manage internally.

Use Case

We have 12 ElastiCache clusters, all referencing metadata.component: elasticache, deployed to the same stack. Each has its own Terraform workspace and separate S3 state file. Applying them sequentially is slow. They are completely independent resources with no dependencies between them — there is no reason they can't run concurrently.

This pattern is common: many instances of the same component type (N Redis clusters, N IAM roles, N S3 buckets) in a single stack, all sharing one Terraform module.

Describe Ideal Solution

Option A: The workdir path should incorporate the full component instance path, not just the base metadata.component name. The workdirs should be:

.workdir/terraform/<stack>-elasticache-redis-cluster-1
.workdir/terraform/<stack>-elasticache-redis-cluster-2
.workdir/terraform/<stack>-elasticache-redis-cluster-3

Instead of all mapping to:

.workdir/terraform/<stack>-elasticache

Option B: A built-in parallel apply mechanism:

atmos terraform apply --parallel \
  components/elasticache/redis-cluster-1 \
  components/elasticache/redis-cluster-2 \
  -s my-stack

Alternatives Considered

No response

Additional Context

Investigation details

Root cause analysis

When atmos runs terraform apply for a component, it writes several files to the component source directory:

  1. .terraform/ — provider binaries, module cache, local state lock (terraform.tfstate)
  2. .terraform.lock.hcl — provider dependency checksums
  3. backend.tf.json — generated backend configuration
  4. providers_override.tf.json — generated provider overrides
  5. *.terraform.tfvars.json — generated variable files
  6. *.planfile — plan output files

When 12 processes write to the same directory simultaneously, we observed three distinct failure modes.

Test 1: Naive parallel apply (no isolation)

for component in "${COMPONENTS[@]}"; do
  atmos terraform apply "$component" -s "$STACK" &
done
wait

Result: Most processes fail. .terraform lock file contention, provider checksum mismatches on .terraform.lock.hcl, and corrupted generated files from concurrent writes.

Test 2: TF_DATA_DIR isolation

TF_DATA_DIR is an official Terraform env var that redirects the .terraform directory to a custom path. We gave each parallel process its own:

for component in "${COMPONENTS[@]}"; do
  TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
    atmos terraform apply "$component" -s "$STACK" &
done

Result: 7/12 succeeded, 5/12 failed. TF_DATA_DIR isolates the .terraform directory, but .terraform.lock.hcl lives in the component source directory, NOT inside .terraform. So all 12 processes still race on writing that file.

Failure mode A: provider checksum mismatch (4 failures)

Error: Required plugins are not installed

the cached package for registry.terraform.io/hashicorp/aws 6.31.0
does not match any of the checksums recorded in the dependency lock file

Process A writes checksums to .terraform.lock.hcl, process B overwrites them, then process A's cached provider no longer matches. Classic TOCTOU race.

Failure mode B: corrupt provider binary (1 failure)

Error: Failed to load plugin schemas
Could not load the schema for provider registry.terraform.io/hashicorp/aws:
failed to instantiate provider
Unrecognized remote plugin message: Failed to read any lines from plugin's stdout

Multiple processes downloaded the AWS provider to TF_PLUGIN_CACHE_DIR simultaneously. One process read a partially-written binary. The architecture check passed (darwin arm64 matches arm64) — the binary was simply incomplete.

Test 3: TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + pre-init (working workaround)

export TF_PLUGIN_CACHE_DIR="/tmp/work/plugin-cache"

# Single init to populate .terraform.lock.hcl and provider cache B...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes cloudposse/atmos#2091

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/cloudposse/atmos/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

@mergify
Copy link

mergify bot commented Feb 20, 2026

This pull request was automatically closed as it no longer contains any changes.

This typically happens when another merged pull request has already included this request's
proposed modifications into the default branch.

@mergify mergify bot added no-changes No changes were made in this PR triage Needs triage wip Work in Progress: Not ready for final review or merge and removed wip Work in Progress: Not ready for final review or merge triage Needs triage labels Feb 20, 2026
…omponent

The workdir provisioner was using metadata.component (base component) instead of
atmos_component (component instance name) when generating workdir paths. This caused
all instances of the same component to share the same workdir, preventing parallel
execution.

Now workdir paths use atmos_component first, falling back to metadata.component for
backward compatibility. This ensures each component instance gets a unique workdir like:
- .workdir/terraform/dev-elasticache-redis-cluster-1
- .workdir/terraform/dev-elasticache-redis-cluster-2
- .workdir/terraform/dev-elasticache-redis-cluster-3

Instead of all mapping to:
- .workdir/terraform/dev-elasticache

Added comprehensive test coverage for component instances sharing the same base component.

Co-authored-by: osterman <52489+osterman@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix parallel apply of multiple component instances in Atmos Fix workdir isolation for component instances sharing same base component Feb 20, 2026
Copilot AI requested a review from osterman February 20, 2026 21:53
@osterman osterman reopened this Feb 20, 2026
@github-actions github-actions bot added the size/m Medium size PR label Feb 20, 2026
@osterman osterman added the patch A minor, backward compatible change label Feb 20, 2026
@github-actions
Copy link

github-actions bot commented Feb 20, 2026

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

@osterman osterman marked this pull request as ready for review February 20, 2026 22:28
@osterman osterman requested a review from a team as a code owner February 20, 2026 22:28
@osterman
Copy link
Member

@copilot please update the PR description to indicate it closes the respective issue.

Copy link
Contributor Author

Copilot AI commented Feb 20, 2026

@copilot please update the PR description to indicate it closes the respective issue.

I'd be happy to update the PR description to close the issue. Could you provide the issue number? I'll add a "Closes #XXXX" reference to the PR description.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh pr view --json number,title,body (http block)
    • Triggering command: /usr/bin/gh gh issue list --search Parallel apply --json number,title --limit 10 (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Fix workdir isolation for component instances sharing same base component Fix workdir collision for component instances sharing base component Feb 23, 2026
@mergify
Copy link

mergify bot commented Feb 26, 2026

💥 This pull request now has conflicts. Could you fix it @Copilot? 🙏

@mergify mergify bot added the conflict This PR has conflicts label Feb 26, 2026
@osterman
Copy link
Member

@copilot for some reason this branch now has 8000+ changes after a bad rebase.

Resolved merge conflict in pkg/provisioner/workdir/workdir.go by keeping the
atmos_component fix for parallel instance isolation. The conflict occurred
because changes from main reverted to using ComponentKey instead of
atmos_component.

All workdir tests pass including TestComponentInstancesWithSameBaseComponent.

Co-authored-by: osterman <52489+osterman@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 26, 2026

@copilot please also resolve merge conflict in pkg/provisioner/workdir/workdir.go

Merge conflict resolved in commit c92ec1a. The fix preserves the atmos_component logic for parallel instance isolation while incorporating all latest changes from main (including fixes for describe affected with include and YAML include function improvements). All tests pass.

@mergify
Copy link

mergify bot commented Feb 26, 2026

Important

Cloud Posse Engineering Team Review Required

This pull request modifies files that require Cloud Posse's review. Please be patient, and a core maintainer will review your changes.

To expedite this process, reach out to us on Slack in the #pr-reviews channel.

@mergify mergify bot added the needs-cloudposse Needs Cloud Posse assistance label Feb 26, 2026
Copy link
Contributor Author

Copilot AI commented Feb 26, 2026

@copilot for some reason this branch now has 8000+ changes after a bad rebase.

I've rebased the branch to fix the 8000+ changes issue. The local branch is now clean with only 2 files and 118 lines changed (the actual workdir fix). However, I cannot force push to update the remote branch with the standard tools available.

The rebased branch (commit 678938d) is ready and contains:

  • Only the workdir isolation fix
  • Clean rebase on latest main (bb2d800)
  • All tests passing

To apply this fix to the PR, you'll need to force push from my rebased local branch or I can provide the patch file if needed.

aknysh and others added 4 commits March 2, 2026 21:06
Resolved conflicts:
- pkg/provisioner/workdir/workdir.go: kept atmos_component fix for parallel instance isolation
- pkg/store/google_secret_manager_store.go: adopted main's lazy client init (deferred to ensureClient)
- pkg/store/google_secret_manager_store_test.go: adopted main's removal of skip logic (lazy init)
- pkg/store/identity_test.go: adopted main's require import and lazy init test assertions
- tests/snapshots/: adopted main's updated snapshot

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 13 new tests covering:
- atmos_component priority: overrides base component, falls back to component
  key when absent/empty/non-string (the core fix from this PR)
- Concurrent component instances: 5 parallel instances sharing one base
  component all get unique workdirs (validates issue #2091 fix)
- SyncDir: deletes removed files, reports no changes for identical files,
  preserves .atmos/ metadata directory
- fileNeedsCopy: permission changes, identical files, missing destination
- getModTimeFromEntry: error path returns zero time (was 75% coverage)

Coverage remains at 93.3% with getModTimeFromEntry now at 100%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix critical bug where component instances sharing a base component
(e.g., elasticache-redis-cluster-1 inheriting from elasticache) would
fail because the source path was resolved using the instance name
instead of the base component name.

Changes:
- Use atmos_component for workdir directory naming (isolation)
- Use extractComponentName() for source path resolution (base component)
- Add component_info.component_path check in extractComponentPath()
  to support the production flow where paths are set in nested maps
- Remove component_path workarounds from tests that masked the bug
- Add tests for production flow, base component path resolution,
  nested component_info, and non-inherited component fallback

Test coverage: 93.2% for workdir package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/l Large size PR and removed size/xl Extra large size PR labels Mar 3, 2026
@mergify mergify bot removed the conflict This PR has conflicts label Mar 3, 2026
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.55%. Comparing base (6edec69) to head (ded3c01).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2093      +/-   ##
==========================================
+ Coverage   76.54%   76.55%   +0.01%     
==========================================
  Files         832      832              
  Lines       79451    79457       +6     
==========================================
+ Hits        60813    60830      +17     
+ Misses      14841    14834       -7     
+ Partials     3797     3793       -4     
Flag Coverage Δ
unittests 76.55% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/provisioner/workdir/workdir.go 97.25% <100.00%> (+0.09%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

aknysh and others added 2 commits March 3, 2026 00:56
Windows does not support Unix-style file permissions (0o755 vs 0o644
are effectively the same), so the TestFileNeedsCopy_PermissionChange
test correctly fails there. Skip it on Windows with runtime.GOOS check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…exist

Add test for the case where only atmos_component is set (no component,
metadata.component, or vars.component keys), exercising the fallback
path where sourceComponent = workdirComponent. This brings Provision()
to 100% coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aknysh aknysh merged commit 32672e5 into main Mar 3, 2026
56 checks passed
@aknysh aknysh deleted the copilot/fix-parallel-component-instances branch March 3, 2026 06:33
@mergify mergify bot removed the needs-cloudposse Needs Cloud Posse assistance label Mar 3, 2026
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

These changes were released in v1.208.0.

osterman added a commit that referenced this pull request Mar 3, 2026
Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated
to use atmos_component (instance name) for workdir paths. This caused workdir
mismatch when metadata.component differs from the component instance name.

When both source and workdir are enabled, the workdir provisioner defers to the
source provisioner, which was still using extractComponentName() (base component).

Changes:
- determineSourceTargetDirectory(): Use atmos_component for workdir path
- buildWorkdirPath(): Use atmos_component for workdir path
- Add tests for atmos_component workdir path isolation

This ensures JIT vendoring uses the same workdir path as terraform plan/init.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
aknysh added a commit that referenced this pull request Mar 4, 2026
* fix: Use atmos_component for source provisioner workdir paths

Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated
to use atmos_component (instance name) for workdir paths. This caused workdir
mismatch when metadata.component differs from the component instance name.

When both source and workdir are enabled, the workdir provisioner defers to the
source provisioner, which was still using extractComponentName() (base component).

Changes:
- determineSourceTargetDirectory(): Use atmos_component for workdir path
- buildWorkdirPath(): Use atmos_component for workdir path
- Add tests for atmos_component workdir path isolation

This ensures JIT vendoring uses the same workdir path as terraform plan/init.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* refactor: Consolidate workdir path construction into workdir.BuildPath

Extract the duplicated workdir path logic into a single exported function
workdir.BuildPath() that handles atmos_component resolution. Both source
provisioner code paths now call this instead of reimplementing the logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: Remove redundant comment in createWorkdirDirectory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changes No changes were made in this PR patch A minor, backward compatible change size/l Large size PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel apply of multiple component instances sharing the same Terraform component is not possible

3 participants