Fix workdir collision for component instances sharing base component#2093
Fix workdir collision for component instances sharing base component#2093
Conversation
|
This pull request was automatically closed as it no longer contains any changes. This typically happens when another merged pull request has already included this request's |
…omponent The workdir provisioner was using metadata.component (base component) instead of atmos_component (component instance name) when generating workdir paths. This caused all instances of the same component to share the same workdir, preventing parallel execution. Now workdir paths use atmos_component first, falling back to metadata.component for backward compatibility. This ensures each component instance gets a unique workdir like: - .workdir/terraform/dev-elasticache-redis-cluster-1 - .workdir/terraform/dev-elasticache-redis-cluster-2 - .workdir/terraform/dev-elasticache-redis-cluster-3 Instead of all mapping to: - .workdir/terraform/dev-elasticache Added comprehensive test coverage for component instances sharing the same base component. Co-authored-by: osterman <52489+osterman@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues found.Scanned FilesNone |
|
@copilot please update the PR description to indicate it closes the respective issue. |
I'd be happy to update the PR description to close the issue. Could you provide the issue number? I'll add a "Closes #XXXX" reference to the PR description. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
💥 This pull request now has conflicts. Could you fix it @Copilot? 🙏 |
|
@copilot for some reason this branch now has 8000+ changes after a bad rebase. |
Resolved merge conflict in pkg/provisioner/workdir/workdir.go by keeping the atmos_component fix for parallel instance isolation. The conflict occurred because changes from main reverted to using ComponentKey instead of atmos_component. All workdir tests pass including TestComponentInstancesWithSameBaseComponent. Co-authored-by: osterman <52489+osterman@users.noreply.github.com>
Merge conflict resolved in commit c92ec1a. The fix preserves the atmos_component logic for parallel instance isolation while incorporating all latest changes from main (including fixes for describe affected with include and YAML include function improvements). All tests pass. |
|
Important Cloud Posse Engineering Team Review RequiredThis pull request modifies files that require Cloud Posse's review. Please be patient, and a core maintainer will review your changes. To expedite this process, reach out to us on Slack in the |
I've rebased the branch to fix the 8000+ changes issue. The local branch is now clean with only 2 files and 118 lines changed (the actual workdir fix). However, I cannot force push to update the remote branch with the standard tools available. The rebased branch (commit 678938d) is ready and contains:
To apply this fix to the PR, you'll need to force push from my rebased local branch or I can provide the patch file if needed. |
Resolved conflicts: - pkg/provisioner/workdir/workdir.go: kept atmos_component fix for parallel instance isolation - pkg/store/google_secret_manager_store.go: adopted main's lazy client init (deferred to ensureClient) - pkg/store/google_secret_manager_store_test.go: adopted main's removal of skip logic (lazy init) - pkg/store/identity_test.go: adopted main's require import and lazy init test assertions - tests/snapshots/: adopted main's updated snapshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 13 new tests covering: - atmos_component priority: overrides base component, falls back to component key when absent/empty/non-string (the core fix from this PR) - Concurrent component instances: 5 parallel instances sharing one base component all get unique workdirs (validates issue #2091 fix) - SyncDir: deletes removed files, reports no changes for identical files, preserves .atmos/ metadata directory - fileNeedsCopy: permission changes, identical files, missing destination - getModTimeFromEntry: error path returns zero time (was 75% coverage) Coverage remains at 93.3% with getModTimeFromEntry now at 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix critical bug where component instances sharing a base component (e.g., elasticache-redis-cluster-1 inheriting from elasticache) would fail because the source path was resolved using the instance name instead of the base component name. Changes: - Use atmos_component for workdir directory naming (isolation) - Use extractComponentName() for source path resolution (base component) - Add component_info.component_path check in extractComponentPath() to support the production flow where paths are set in nested maps - Remove component_path workarounds from tests that masked the bug - Add tests for production flow, base component path resolution, nested component_info, and non-inherited component fallback Test coverage: 93.2% for workdir package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2093 +/- ##
==========================================
+ Coverage 76.54% 76.55% +0.01%
==========================================
Files 832 832
Lines 79451 79457 +6
==========================================
+ Hits 60813 60830 +17
+ Misses 14841 14834 -7
+ Partials 3797 3793 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Windows does not support Unix-style file permissions (0o755 vs 0o644 are effectively the same), so the TestFileNeedsCopy_PermissionChange test correctly fails there. Skip it on Windows with runtime.GOOS check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…exist Add test for the case where only atmos_component is set (no component, metadata.component, or vars.component keys), exercising the fallback path where sourceComponent = workdirComponent. This brings Provision() to 100% coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
These changes were released in v1.208.0. |
Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated to use atmos_component (instance name) for workdir paths. This caused workdir mismatch when metadata.component differs from the component instance name. When both source and workdir are enabled, the workdir provisioner defers to the source provisioner, which was still using extractComponentName() (base component). Changes: - determineSourceTargetDirectory(): Use atmos_component for workdir path - buildWorkdirPath(): Use atmos_component for workdir path - Add tests for atmos_component workdir path isolation This ensures JIT vendoring uses the same workdir path as terraform plan/init. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* fix: Use atmos_component for source provisioner workdir paths Follow-up to PR #2093 - the source provisioner (JIT vendoring) was not updated to use atmos_component (instance name) for workdir paths. This caused workdir mismatch when metadata.component differs from the component instance name. When both source and workdir are enabled, the workdir provisioner defers to the source provisioner, which was still using extractComponentName() (base component). Changes: - determineSourceTargetDirectory(): Use atmos_component for workdir path - buildWorkdirPath(): Use atmos_component for workdir path - Add tests for atmos_component workdir path isolation This ensures JIT vendoring uses the same workdir path as terraform plan/init. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * refactor: Consolidate workdir path construction into workdir.BuildPath Extract the duplicated workdir path logic into a single exported function workdir.BuildPath() that handles atmos_component resolution. Both source provisioner code paths now call this instead of reimplementing the logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: Remove redundant comment in createWorkdirDirectory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com> Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
Closes #2091
Problem
When multiple component instances share the same Terraform component (
metadata.component), they cannot be applied in parallel because they all write to the same workdir. This is because the workdir path was generated usingmetadata.component(the base component name) instead of the component instance name.Example scenario that was broken:
metadata.component: elasticache.workdir/terraform/dev-elasticacheSolution
Modified the workdir path generation in
pkg/provisioner/workdir/workdir.goto useatmos_component(the full component instance path) instead ofmetadata.component. This ensures each component instance gets its own unique workdir:.workdir/terraform/dev-elasticache-redis-cluster-1.workdir/terraform/dev-elasticache-redis-cluster-2.workdir/terraform/dev-elasticache-redis-cluster-3The fix includes:
atmos_componentfor workdir path generationextractComponentName()for backward compatibilityChanges
atmos_componentfirstTestComponentInstancesWithSameBaseComponenttestTotal changes: 2 files, 118 insertions, 3 deletions
Testing
Impact
Enables parallel
terraform applyfor component instances like:Before: Both instances →
.workdir/terraform/dev-elasticache(shared, conflicts)After: Unique workdirs →
.workdir/terraform/dev-elasticache-redis-cluster-{1,2}(isolated)Each instance now gets its own
.terraform/directory, lock files, and generated configs—no coordination required.Original prompt
This section details on the original issue you should resolve
<issue_title>Parallel apply of multiple component instances sharing the same Terraform component is not possible</issue_title>
<issue_description>### Describe the Feature
Atmos cannot apply multiple component instances that share the same Terraform component (
metadata.component) in parallel. All instances write to the same component source directory, causing lock contention, checksum races, and corrupted provider binaries. The existingprovision.workdir.enabledfeature does not solve this — it isolates by<stack>-<component>, so all instances of the same component within the same stack still share one workdir.Expected Behavior
Running multiple
atmos terraform applycommands in parallel for component instances that share the same base component should work without file conflicts. Each instance already has its own Terraform workspace and separate remote state — the only barrier is local filesystem contention that atmos should manage internally.Use Case
We have 12 ElastiCache clusters, all referencing
metadata.component: elasticache, deployed to the same stack. Each has its own Terraform workspace and separate S3 state file. Applying them sequentially is slow. They are completely independent resources with no dependencies between them — there is no reason they can't run concurrently.This pattern is common: many instances of the same component type (N Redis clusters, N IAM roles, N S3 buckets) in a single stack, all sharing one Terraform module.
Describe Ideal Solution
Option A: The workdir path should incorporate the full component instance path, not just the base
metadata.componentname. The workdirs should be:Instead of all mapping to:
Option B: A built-in parallel apply mechanism:
Alternatives Considered
No response
Additional Context
Investigation details
Root cause analysis
When atmos runs
terraform applyfor a component, it writes several files to the component source directory:.terraform/— provider binaries, module cache, local state lock (terraform.tfstate).terraform.lock.hcl— provider dependency checksumsbackend.tf.json— generated backend configurationproviders_override.tf.json— generated provider overrides*.terraform.tfvars.json— generated variable files*.planfile— plan output filesWhen 12 processes write to the same directory simultaneously, we observed three distinct failure modes.
Test 1: Naive parallel apply (no isolation)
Result: Most processes fail.
.terraformlock file contention, provider checksum mismatches on.terraform.lock.hcl, and corrupted generated files from concurrent writes.Test 2:
TF_DATA_DIRisolationTF_DATA_DIRis an official Terraform env var that redirects the.terraformdirectory to a custom path. We gave each parallel process its own:Result: 7/12 succeeded, 5/12 failed.
TF_DATA_DIRisolates the.terraformdirectory, but.terraform.lock.hcllives in the component source directory, NOT inside.terraform. So all 12 processes still race on writing that file.Failure mode A: provider checksum mismatch (4 failures)
Process A writes checksums to
.terraform.lock.hcl, process B overwrites them, then process A's cached provider no longer matches. Classic TOCTOU race.Failure mode B: corrupt provider binary (1 failure)
Multiple processes downloaded the AWS provider to
TF_PLUGIN_CACHE_DIRsimultaneously. One process read a partially-written binary. The architecture check passed (darwin arm64 matches arm64) — the binary was simply incomplete.Test 3:
TF_DATA_DIR+TF_PLUGIN_CACHE_DIR+ pre-init (working workaround)