-
-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Describe the Feature
Atmos cannot apply multiple component instances that share the same Terraform component (metadata.component) in parallel. All instances write to the same component source directory, causing lock contention, checksum races, and corrupted provider binaries. The existing provision.workdir.enabled feature does not solve this — it isolates by <stack>-<component>, so all instances of the same component within the same stack still share one workdir.
Expected Behavior
Running multiple atmos terraform apply commands in parallel for component instances that share the same base component should work without file conflicts. Each instance already has its own Terraform workspace and separate remote state — the only barrier is local filesystem contention that atmos should manage internally.
Use Case
We have 12 ElastiCache clusters, all referencing metadata.component: elasticache, deployed to the same stack. Each has its own Terraform workspace and separate S3 state file. Applying them sequentially is slow. They are completely independent resources with no dependencies between them — there is no reason they can't run concurrently.
This pattern is common: many instances of the same component type (N Redis clusters, N IAM roles, N S3 buckets) in a single stack, all sharing one Terraform module.
Describe Ideal Solution
Option A: The workdir path should incorporate the full component instance path, not just the base metadata.component name. The workdirs should be:
.workdir/terraform/<stack>-elasticache-redis-cluster-1
.workdir/terraform/<stack>-elasticache-redis-cluster-2
.workdir/terraform/<stack>-elasticache-redis-cluster-3
Instead of all mapping to:
.workdir/terraform/<stack>-elasticache
Option B: A built-in parallel apply mechanism:
atmos terraform apply --parallel \
components/elasticache/redis-cluster-1 \
components/elasticache/redis-cluster-2 \
-s my-stack
Alternatives Considered
No response
Additional Context
Investigation details
Root cause analysis
When atmos runs terraform apply for a component, it writes several files to the component source directory:
.terraform/— provider binaries, module cache, local state lock (terraform.tfstate).terraform.lock.hcl— provider dependency checksumsbackend.tf.json— generated backend configurationproviders_override.tf.json— generated provider overrides*.terraform.tfvars.json— generated variable files*.planfile— plan output files
When 12 processes write to the same directory simultaneously, we observed three distinct failure modes.
Test 1: Naive parallel apply (no isolation)
for component in "${COMPONENTS[@]}"; do
atmos terraform apply "$component" -s "$STACK" &
done
wait
Result: Most processes fail. .terraform lock file contention, provider checksum mismatches on .terraform.lock.hcl, and corrupted generated files from concurrent writes.
Test 2: TF_DATA_DIR isolation
TF_DATA_DIR is an official Terraform env var that redirects the .terraform directory to a custom path. We gave each parallel process its own:
for component in "${COMPONENTS[@]}"; do
TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
atmos terraform apply "$component" -s "$STACK" &
done
Result: 7/12 succeeded, 5/12 failed. TF_DATA_DIR isolates the .terraform directory, but .terraform.lock.hcl lives in the component source directory, NOT inside .terraform. So all 12 processes still race on writing that file.
Failure mode A: provider checksum mismatch (4 failures)
Error: Required plugins are not installed
the cached package for registry.terraform.io/hashicorp/aws 6.31.0
does not match any of the checksums recorded in the dependency lock file
Process A writes checksums to .terraform.lock.hcl, process B overwrites them, then process A's cached provider no longer matches. Classic TOCTOU race.
Failure mode B: corrupt provider binary (1 failure)
Error: Failed to load plugin schemas
Could not load the schema for provider registry.terraform.io/hashicorp/aws:
failed to instantiate provider
Unrecognized remote plugin message: Failed to read any lines from plugin's stdout
Multiple processes downloaded the AWS provider to TF_PLUGIN_CACHE_DIR simultaneously. One process read a partially-written binary. The architecture check passed (darwin arm64 matches arm64) — the binary was simply incomplete.
Test 3: TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + pre-init (working workaround)
export TF_PLUGIN_CACHE_DIR="/tmp/work/plugin-cache"
# Single init to populate .terraform.lock.hcl and provider cache BEFORE parallel runs
TF_DATA_DIR="/tmp/work/tf-data/first" \
atmos terraform init "${COMPONENTS[0]}" -s "$STACK"
# Now parallel applies — lock file and cache are already warm
for component in "${COMPONENTS[@]}"; do
TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
atmos terraform apply "$component" -s "$STACK" &
done
Result: 12/12 succeeded. The pre-init populates .terraform.lock.hcl and the plugin cache before any parallel process runs. Subsequent inits read the lock file and symlink from the cache — no concurrent writes.
This works but is a hack. It requires the caller to understand Terraform internals (TF_DATA_DIR, TF_PLUGIN_CACHE_DIR) and manage temp directories, cleanup, and process lifecycle outside of atmos.
Test 4: provision.workdir.enabled: true (atmos native feature — DOES NOT WORK for this case)
After discovering the Component Workdir Isolation feature, we enabled it on all 12 components:
components/elasticache/redis-cluster-1:
metadata:
component: elasticache
provision:
workdir:
enabled: true
Then ran a simple parallel apply with no TF_DATA_DIR workarounds:
for component in "${COMPONENTS[@]}"; do
atmos terraform apply "$component" -s "$STACK" &
done
Result: 1/12 succeeded, 11/12 failed. All 12 components resolved to the exact same workdir:
.workdir/terraform/<stack>-elasticache
The workdir path is derived from <stack>-<component>, where <component> is the metadata.component value (elasticache). Since all 12 instances share the same stack and the same base component, they all map to one directory.
The 11 failures all hit the same local state lock:
Error: Error locking state: Error acquiring the state lock
Error message: resource temporarily unavailable
Lock Info:
ID: 2c072bde-8527-00a3-49bb-7940faa90d7f
Path: .terraform/terraform.tfstate
Operation: backend from plan
Version: 1.14.3
The workdir feature solves a different problem: same component across different stacks (e.g. dev-vpc vs prod-vpc → different workdirs). It does not solve multiple instances of the same component within the same stack, because the workdir path doesn't incorporate the component instance path — only the base component name.
Current workaround
Our working solution is a bash script that combines TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + a serial pre-init step. It works but shouldn't be necessary — atmos should handle this natively.
I'll drop here our shell script to apply it in parallel because I think someone else could take benefits on it:
#!/bin/bash
#
# Apply multiple atmos components in parallel.
#
# Usage:
# ./scripts/multiple-applies.sh <stack> <component> [component ...]
#
# Example:
# ./scripts/multiple-applies.sh my-stack-dev \
# infrastructure/dev/us-west-2/elasticache/redis-cluster-1 \
# infrastructure/dev/us-west-2/elasticache/redis-cluster-2
set -e
STACK="$1"; shift 2>/dev/null || true
COMPONENTS=("$@")
if [ -z "$STACK" ] || [ ${#COMPONENTS[@]} -eq 0 ]; then
echo "Usage: $0 <stack> <component> [component ...]"
exit 1
fi
WORK_DIR="/tmp/multiple-applies-$$"
LOG_DIR="${WORK_DIR}/logs"
mkdir -p "$LOG_DIR"
export TF_PLUGIN_CACHE_DIR="${WORK_DIR}/plugin-cache"
mkdir -p "$TF_PLUGIN_CACHE_DIR"
PIDS=()
NAMES=()
cleanup() {
echo ""
echo "Interrupted — killing background jobs..."
for pid in "${PIDS[@]}"; do kill "$pid" 2>/dev/null; done
wait
exit 1
}
trap cleanup INT TERM
echo "Applying ${#COMPONENTS[@]} components to ${STACK} in parallel..."
# Single init to warm up plugin cache and .terraform.lock.hcl before parallel runs
echo "Initializing providers..."
TF_DATA_DIR="${WORK_DIR}/tf-data/$(basename "${COMPONENTS[0]}")" \
atmos terraform init "${COMPONENTS[0]}" -s "$STACK"
for component in "${COMPONENTS[@]}"; do
name=$(basename "$component")
echo "Starting: ${name}"
TF_DATA_DIR="${WORK_DIR}/tf-data/${name}" \
atmos terraform apply "$component" -s "$STACK" \
> "$LOG_DIR/${name}.log" 2>&1 &
PIDS+=($!)
NAMES+=("$name")
done
TOTAL=${#COMPONENTS[@]}
echo ""
echo "All ${TOTAL} applies launched. To follow a specific component:"
echo " tail -f ${LOG_DIR}/<name>.log"
echo ""
FAILED=0
for i in "${!PIDS[@]}"; do
if wait "${PIDS[$i]}"; then
echo "[SUCCESS] ${NAMES[$i]} ($((i + 1))/${TOTAL})"
else
echo "[FAILED] ${NAMES[$i]} ($((i + 1))/${TOTAL}) — see ${LOG_DIR}/${NAMES[$i]}.log"
FAILED=$((FAILED + 1))
fi
done
echo ""
echo "=========================================="
if [ "$FAILED" -eq 0 ]; then
echo "All ${TOTAL} components applied successfully!"
else
echo "${FAILED}/${TOTAL} components failed."
fi
echo "Logs: ${LOG_DIR}"
echo "Cleanup: rm -rf ${WORK_DIR}"
echo "=========================================="
exit "$FAILED"