Fix workdir collision for component instances sharing base component by Copilot · Pull Request #2093 · cloudposse/atmos

Copilot · 2026-02-20T21:42:32Z

Closes #2091

Problem

When multiple component instances share the same Terraform component (metadata.component), they cannot be applied in parallel because they all write to the same workdir. This is because the workdir path was generated using metadata.component (the base component name) instead of the component instance name.

Example scenario that was broken:

12 ElastiCache clusters, all with metadata.component: elasticache
All instances mapped to .workdir/terraform/dev-elasticache
Parallel execution caused lock contention and file conflicts

Solution

Modified the workdir path generation in pkg/provisioner/workdir/workdir.go to use atmos_component (the full component instance path) instead of metadata.component. This ensures each component instance gets its own unique workdir:

.workdir/terraform/dev-elasticache-redis-cluster-1
.workdir/terraform/dev-elasticache-redis-cluster-2
.workdir/terraform/dev-elasticache-redis-cluster-3

The fix includes:

Prioritize atmos_component for workdir path generation
Fallback to extractComponentName() for backward compatibility
Comprehensive test coverage for component instances sharing the same base component

Changes

pkg/provisioner/workdir/workdir.go: Updated component name extraction to use atmos_component first
pkg/provisioner/workdir/integration_test.go: Added TestComponentInstancesWithSameBaseComponent test

Total changes: 2 files, 118 insertions, 3 deletions

Testing

All existing workdir unit tests pass (30+ tests)
New test validates that 3 component instances sharing the same base component get unique workdirs
Verified backward compatibility with the fallback mechanism
Clean rebase on latest main

Impact

Enables parallel terraform apply for component instances like:

components:
  terraform:
    elasticache-redis-cluster-1:
      metadata:
        component: elasticache  # Base component
      provision:
        workdir:
          enabled: true

    elasticache-redis-cluster-2:
      metadata:
        component: elasticache  # Same base component
      provision:
        workdir:
          enabled: true

Before: Both instances → .workdir/terraform/dev-elasticache (shared, conflicts)
After: Unique workdirs → .workdir/terraform/dev-elasticache-redis-cluster-{1,2} (isolated)

Each instance now gets its own .terraform/ directory, lock files, and generated configs—no coordination required.

Original prompt

This section details on the original issue you should resolve

<issue_title>Parallel apply of multiple component instances sharing the same Terraform component is not possible</issue_title>
<issue_description>### Describe the Feature

Atmos cannot apply multiple component instances that share the same Terraform component (metadata.component) in parallel. All instances write to the same component source directory, causing lock contention, checksum races, and corrupted provider binaries. The existing provision.workdir.enabled feature does not solve this — it isolates by <stack>-<component>, so all instances of the same component within the same stack still share one workdir.

Expected Behavior

Running multiple atmos terraform apply commands in parallel for component instances that share the same base component should work without file conflicts. Each instance already has its own Terraform workspace and separate remote state — the only barrier is local filesystem contention that atmos should manage internally.

Use Case

We have 12 ElastiCache clusters, all referencing metadata.component: elasticache, deployed to the same stack. Each has its own Terraform workspace and separate S3 state file. Applying them sequentially is slow. They are completely independent resources with no dependencies between them — there is no reason they can't run concurrently.

This pattern is common: many instances of the same component type (N Redis clusters, N IAM roles, N S3 buckets) in a single stack, all sharing one Terraform module.

Describe Ideal Solution

Option A: The workdir path should incorporate the full component instance path, not just the base metadata.component name. The workdirs should be:

.workdir/terraform/<stack>-elasticache-redis-cluster-1
.workdir/terraform/<stack>-elasticache-redis-cluster-2
.workdir/terraform/<stack>-elasticache-redis-cluster-3

Instead of all mapping to:

.workdir/terraform/<stack>-elasticache

Option B: A built-in parallel apply mechanism:

atmos terraform apply --parallel \
  components/elasticache/redis-cluster-1 \
  components/elasticache/redis-cluster-2 \
  -s my-stack

Alternatives Considered

No response

Additional Context

Investigation details

Root cause analysis

When atmos runs terraform apply for a component, it writes several files to the component source directory:

.terraform/ — provider binaries, module cache, local state lock (terraform.tfstate)
.terraform.lock.hcl — provider dependency checksums
backend.tf.json — generated backend configuration
providers_override.tf.json — generated provider overrides
*.terraform.tfvars.json — generated variable files
*.planfile — plan output files

When 12 processes write to the same directory simultaneously, we observed three distinct failure modes.

Test 1: Naive parallel apply (no isolation)

for component in "${COMPONENTS[@]}"; do
  atmos terraform apply "$component" -s "$STACK" &
done
wait

Result: Most processes fail. .terraform lock file contention, provider checksum mismatches on .terraform.lock.hcl, and corrupted generated files from concurrent writes.

Test 2: `TF_DATA_DIR` isolation

TF_DATA_DIR is an official Terraform env var that redirects the .terraform directory to a custom path. We gave each parallel process its own:

for component in "${COMPONENTS[@]}"; do
  TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
    atmos terraform apply "$component" -s "$STACK" &
done

Result: 7/12 succeeded, 5/12 failed. TF_DATA_DIR isolates the .terraform directory, but .terraform.lock.hcl lives in the component source directory, NOT inside .terraform. So all 12 processes still race on writing that file.

Failure mode A: provider checksum mismatch (4 failures)

Error: Required plugins are not installed

the cached package for registry.terraform.io/hashicorp/aws 6.31.0
does not match any of the checksums recorded in the dependency lock file

Process A writes checksums to .terraform.lock.hcl, process B overwrites them, then process A's cached provider no longer matches. Classic TOCTOU race.

Failure mode B: corrupt provider binary (1 failure)

Error: Failed to load plugin schemas
Could not load the schema for provider registry.terraform.io/hashicorp/aws:
failed to instantiate provider
Unrecognized remote plugin message: Failed to read any lines from plugin's stdout

Multiple processes downloaded the AWS provider to TF_PLUGIN_CACHE_DIR simultaneously. One process read a partially-written binary. The architecture check passed (darwin arm64 matches arm64) — the binary was simply incomplete.

Test 3: `TF_DATA_DIR` + `TF_PLUGIN_CACHE_DIR` + pre-init (working workaround)

export TF_PLUGIN_CACHE_DIR="/tmp/work/plugin-cache"

# Single init to populate .terraform.lock.hcl and provider cache B...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes cloudposse/atmos#2091

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/cloudposse/atmos/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

mergify · 2026-02-20T21:43:11Z

This pull request was automatically closed as it no longer contains any changes.

This typically happens when another merged pull request has already included this request's
proposed modifications into the default branch.

…omponent The workdir provisioner was using metadata.component (base component) instead of atmos_component (component instance name) when generating workdir paths. This caused all instances of the same component to share the same workdir, preventing parallel execution. Now workdir paths use atmos_component first, falling back to metadata.component for backward compatibility. This ensures each component instance gets a unique workdir like: - .workdir/terraform/dev-elasticache-redis-cluster-1 - .workdir/terraform/dev-elasticache-redis-cluster-2 - .workdir/terraform/dev-elasticache-redis-cluster-3 Instead of all mapping to: - .workdir/terraform/dev-elasticache Added comprehensive test coverage for component instances sharing the same base component. Co-authored-by: osterman <52489+osterman@users.noreply.github.com>

github-actions · 2026-02-20T22:25:44Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

osterman · 2026-02-20T22:32:35Z

@copilot please update the PR description to indicate it closes the respective issue.

Copilot · 2026-02-20T22:33:44Z

@copilot please update the PR description to indicate it closes the respective issue.

I'd be happy to update the PR description to close the issue. Could you provide the issue number? I'll add a "Closes #XXXX" reference to the PR description.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/graphql
- Triggering command: /usr/bin/gh gh pr view --json number,title,body (http block)
- Triggering command: /usr/bin/gh gh issue list --search Parallel apply --json number,title --limit 10 (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

mergify · 2026-02-26T17:43:51Z

💥 This pull request now has conflicts. Could you fix it @Copilot? 🙏

osterman · 2026-02-26T21:30:10Z

@copilot for some reason this branch now has 8000+ changes after a bad rebase.

Resolved merge conflict in pkg/provisioner/workdir/workdir.go by keeping the atmos_component fix for parallel instance isolation. The conflict occurred because changes from main reverted to using ComponentKey instead of atmos_component. All workdir tests pass including TestComponentInstancesWithSameBaseComponent. Co-authored-by: osterman <52489+osterman@users.noreply.github.com>

Copilot · 2026-02-26T21:33:19Z

@copilot please also resolve merge conflict in pkg/provisioner/workdir/workdir.go

Merge conflict resolved in commit c92ec1a. The fix preserves the atmos_component logic for parallel instance isolation while incorporating all latest changes from main (including fixes for describe affected with include and YAML include function improvements). All tests pass.

mergify · 2026-02-26T21:33:49Z

Important

Cloud Posse Engineering Team Review Required

This pull request modifies files that require Cloud Posse's review. Please be patient, and a core maintainer will review your changes.

To expedite this process, reach out to us on Slack in the #pr-reviews channel.

Copilot · 2026-02-26T21:41:12Z

@copilot for some reason this branch now has 8000+ changes after a bad rebase.

I've rebased the branch to fix the 8000+ changes issue. The local branch is now clean with only 2 files and 118 lines changed (the actual workdir fix). However, I cannot force push to update the remote branch with the standard tools available.

The rebased branch (commit 678938d) is ready and contains:

Only the workdir isolation fix
Clean rebase on latest main (bb2d800)
All tests passing

To apply this fix to the PR, you'll need to force push from my rebased local branch or I can provide the patch file if needed.

Resolved conflicts: - pkg/provisioner/workdir/workdir.go: kept atmos_component fix for parallel instance isolation - pkg/store/google_secret_manager_store.go: adopted main's lazy client init (deferred to ensureClient) - pkg/store/google_secret_manager_store_test.go: adopted main's removal of skip logic (lazy init) - pkg/store/identity_test.go: adopted main's require import and lazy init test assertions - tests/snapshots/: adopted main's updated snapshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 13 new tests covering: - atmos_component priority: overrides base component, falls back to component key when absent/empty/non-string (the core fix from this PR) - Concurrent component instances: 5 parallel instances sharing one base component all get unique workdirs (validates issue #2091 fix) - SyncDir: deletes removed files, reports no changes for identical files, preserves .atmos/ metadata directory - fileNeedsCopy: permission changes, identical files, missing destination - getModTimeFromEntry: error path returns zero time (was 75% coverage) Coverage remains at 93.3% with getModTimeFromEntry now at 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix critical bug where component instances sharing a base component (e.g., elasticache-redis-cluster-1 inheriting from elasticache) would fail because the source path was resolved using the instance name instead of the base component name. Changes: - Use atmos_component for workdir directory naming (isolation) - Use extractComponentName() for source path resolution (base component) - Add component_info.component_path check in extractComponentPath() to support the production flow where paths are set in nested maps - Remove component_path workarounds from tests that masked the bug - Add tests for production flow, base component path resolution, nested component_info, and non-inherited component fallback Test coverage: 93.2% for workdir package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-03T04:03:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.55%. Comparing base (6edec69) to head (ded3c01).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2093      +/-   ##
==========================================
+ Coverage   76.54%   76.55%   +0.01%     
==========================================
  Files         832      832              
  Lines       79451    79457       +6     
==========================================
+ Hits        60813    60830      +17     
+ Misses      14841    14834       -7     
+ Partials     3797     3793       -4

Flag	Coverage Δ
unittests	`76.55% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/provisioner/workdir/workdir.go	`97.25% <100.00%> (+0.09%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Windows does not support Unix-style file permissions (0o755 vs 0o644 are effectively the same), so the TestFileNeedsCopy_PermissionChange test correctly fails there. Skip it on Windows with runtime.GOOS check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…exist Add test for the case where only atmos_component is set (no component, metadata.component, or vars.component keys), exercising the fallback path where sourceComponent = workdirComponent. This brings Provision() to 100% coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-03T22:35:12Z

These changes were released in v1.208.0.

Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated to use atmos_component (instance name) for workdir paths. This caused workdir mismatch when metadata.component differs from the component instance name. When both source and workdir are enabled, the workdir provisioner defers to the source provisioner, which was still using extractComponentName() (base component). Changes: - determineSourceTargetDirectory(): Use atmos_component for workdir path - buildWorkdirPath(): Use atmos_component for workdir path - Add tests for atmos_component workdir path isolation This ensures JIT vendoring uses the same workdir path as terraform plan/init. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* fix: Use atmos_component for source provisioner workdir paths Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated to use atmos_component (instance name) for workdir paths. This caused workdir mismatch when metadata.component differs from the component instance name. When both source and workdir are enabled, the workdir provisioner defers to the source provisioner, which was still using extractComponentName() (base component). Changes: - determineSourceTargetDirectory(): Use atmos_component for workdir path - buildWorkdirPath(): Use atmos_component for workdir path - Add tests for atmos_component workdir path isolation This ensures JIT vendoring uses the same workdir path as terraform plan/init. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * refactor: Consolidate workdir path construction into workdir.BuildPath Extract the duplicated workdir path logic into a single exported function workdir.BuildPath() that handles atmos_component resolution. Both source provisioner code paths now call this instead of reimplementing the logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: Remove redundant comment in createWorkdirDirectory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com> Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>

Initial plan

b6073d1

Copilot AI assigned Copilot and osterman Feb 20, 2026

Copilot started work on behalf of osterman February 20, 2026 21:42 View session

mergify bot closed this Feb 20, 2026

mergify bot added no-changes No changes were made in this PR triage Needs triage wip Work in Progress: Not ready for final review or merge and removed wip Work in Progress: Not ready for final review or merge triage Needs triage labels Feb 20, 2026

Copilot AI changed the title ~~[WIP] Fix parallel apply of multiple component instances in Atmos~~ Fix workdir isolation for component instances sharing same base component Feb 20, 2026

Copilot AI requested a review from osterman February 20, 2026 21:53

Copilot finished work on behalf of osterman February 20, 2026 21:53

osterman reopened this Feb 20, 2026

github-actions bot added the size/m Medium size PR label Feb 20, 2026

osterman added the patch A minor, backward compatible change label Feb 20, 2026

osterman marked this pull request as ready for review February 20, 2026 22:28

osterman requested a review from a team as a code owner February 20, 2026 22:28

Copilot started work on behalf of osterman February 20, 2026 22:32 View session

Copilot finished work on behalf of osterman February 20, 2026 22:34

Copilot started work on behalf of osterman February 23, 2026 17:41 View session

Copilot AI changed the title ~~Fix workdir isolation for component instances sharing same base component~~ Fix workdir collision for component instances sharing base component Feb 23, 2026

Copilot finished work on behalf of osterman February 23, 2026 17:43

mergify bot added the conflict This PR has conflicts label Feb 26, 2026

Copilot started work on behalf of osterman February 26, 2026 21:29 View session

mergify bot added the needs-cloudposse Needs Cloud Posse assistance label Feb 26, 2026

Copilot finished work on behalf of osterman February 26, 2026 21:34

Copilot started work on behalf of osterman February 26, 2026 21:34 View session

aknysh and others added 4 commits March 2, 2026 21:06

fix tests, increase test coverage

beea46d

github-actions bot added size/l Large size PR and removed size/xl Extra large size PR labels Mar 3, 2026

mergify bot removed the conflict This PR has conflicts label Mar 3, 2026

[autofix.ci] apply automated fixes

fb99c44

aknysh and others added 2 commits March 3, 2026 00:56

aknysh approved these changes Mar 3, 2026

View reviewed changes

aknysh merged commit 32672e5 into main Mar 3, 2026
56 checks passed

aknysh deleted the copilot/fix-parallel-component-instances branch March 3, 2026 06:33

mergify bot removed the needs-cloudposse Needs Cloud Posse assistance label Mar 3, 2026

zack-is-cool mentioned this pull request Mar 3, 2026

JIT vendoring (source pull) writes to a different workdir than terraform plan/init when metadata.component differs from the component instance name #2134

Closed

osterman mentioned this pull request Mar 3, 2026

fix: Use atmos_component for source provisioner workdir paths #2137

Merged

Uh oh!

Conversation

Copilot AI commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Testing

Impact

Expected Behavior

Use Case

Describe Ideal Solution

Alternatives Considered

Additional Context

Investigation details

Root cause analysis

Test 1: Naive parallel apply (no isolation)

Test 2: TF_DATA_DIR isolation

Failure mode A: provider checksum mismatch (4 failures)

Failure mode B: corrupt provider binary (1 failure)

Test 3: TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + pre-init (working workaround)

Uh oh!

mergify bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

osterman commented Feb 20, 2026

Uh oh!

Copilot AI commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

osterman commented Feb 26, 2026

Uh oh!

Copilot AI commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Cloud Posse Engineering Team Review Required

Uh oh!

Copilot AI commented Feb 26, 2026

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 20, 2026 •

edited

Loading

Test 2: `TF_DATA_DIR` isolation

Test 3: `TF_DATA_DIR` + `TF_PLUGIN_CACHE_DIR` + pre-init (working workaround)

github-actions bot commented Feb 20, 2026 •

edited

Loading

Copilot AI commented Feb 20, 2026 •

edited

Loading

codecov bot commented Mar 3, 2026 •

edited

Loading