Skip to content

Parallel apply of multiple component instances sharing the same Terraform component is not possible #2091

@ThiagoFelippi

Description

@ThiagoFelippi

Describe the Feature

Atmos cannot apply multiple component instances that share the same Terraform component (metadata.component) in parallel. All instances write to the same component source directory, causing lock contention, checksum races, and corrupted provider binaries. The existing provision.workdir.enabled feature does not solve this — it isolates by <stack>-<component>, so all instances of the same component within the same stack still share one workdir.

Expected Behavior

Running multiple atmos terraform apply commands in parallel for component instances that share the same base component should work without file conflicts. Each instance already has its own Terraform workspace and separate remote state — the only barrier is local filesystem contention that atmos should manage internally.

Use Case

We have 12 ElastiCache clusters, all referencing metadata.component: elasticache, deployed to the same stack. Each has its own Terraform workspace and separate S3 state file. Applying them sequentially is slow. They are completely independent resources with no dependencies between them — there is no reason they can't run concurrently.

This pattern is common: many instances of the same component type (N Redis clusters, N IAM roles, N S3 buckets) in a single stack, all sharing one Terraform module.

Describe Ideal Solution

Option A: The workdir path should incorporate the full component instance path, not just the base metadata.component name. The workdirs should be:

.workdir/terraform/<stack>-elasticache-redis-cluster-1
.workdir/terraform/<stack>-elasticache-redis-cluster-2
.workdir/terraform/<stack>-elasticache-redis-cluster-3

Instead of all mapping to:

.workdir/terraform/<stack>-elasticache

Option B: A built-in parallel apply mechanism:

atmos terraform apply --parallel \
  components/elasticache/redis-cluster-1 \
  components/elasticache/redis-cluster-2 \
  -s my-stack

Alternatives Considered

No response

Additional Context

Investigation details

Root cause analysis

When atmos runs terraform apply for a component, it writes several files to the component source directory:

  1. .terraform/ — provider binaries, module cache, local state lock (terraform.tfstate)
  2. .terraform.lock.hcl — provider dependency checksums
  3. backend.tf.json — generated backend configuration
  4. providers_override.tf.json — generated provider overrides
  5. *.terraform.tfvars.json — generated variable files
  6. *.planfile — plan output files

When 12 processes write to the same directory simultaneously, we observed three distinct failure modes.

Test 1: Naive parallel apply (no isolation)

for component in "${COMPONENTS[@]}"; do
  atmos terraform apply "$component" -s "$STACK" &
done
wait

Result: Most processes fail. .terraform lock file contention, provider checksum mismatches on .terraform.lock.hcl, and corrupted generated files from concurrent writes.

Test 2: TF_DATA_DIR isolation

TF_DATA_DIR is an official Terraform env var that redirects the .terraform directory to a custom path. We gave each parallel process its own:

for component in "${COMPONENTS[@]}"; do
  TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
    atmos terraform apply "$component" -s "$STACK" &
done

Result: 7/12 succeeded, 5/12 failed. TF_DATA_DIR isolates the .terraform directory, but .terraform.lock.hcl lives in the component source directory, NOT inside .terraform. So all 12 processes still race on writing that file.

Failure mode A: provider checksum mismatch (4 failures)

Error: Required plugins are not installed

the cached package for registry.terraform.io/hashicorp/aws 6.31.0
does not match any of the checksums recorded in the dependency lock file

Process A writes checksums to .terraform.lock.hcl, process B overwrites them, then process A's cached provider no longer matches. Classic TOCTOU race.

Failure mode B: corrupt provider binary (1 failure)

Error: Failed to load plugin schemas
Could not load the schema for provider registry.terraform.io/hashicorp/aws:
failed to instantiate provider
Unrecognized remote plugin message: Failed to read any lines from plugin's stdout

Multiple processes downloaded the AWS provider to TF_PLUGIN_CACHE_DIR simultaneously. One process read a partially-written binary. The architecture check passed (darwin arm64 matches arm64) — the binary was simply incomplete.

Test 3: TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + pre-init (working workaround)

export TF_PLUGIN_CACHE_DIR="/tmp/work/plugin-cache"

# Single init to populate .terraform.lock.hcl and provider cache BEFORE parallel runs
TF_DATA_DIR="/tmp/work/tf-data/first" \
  atmos terraform init "${COMPONENTS[0]}" -s "$STACK"

# Now parallel applies — lock file and cache are already warm
for component in "${COMPONENTS[@]}"; do
  TF_DATA_DIR="/tmp/work/tf-data/$(basename "$component")" \
    atmos terraform apply "$component" -s "$STACK" &
done

Result: 12/12 succeeded. The pre-init populates .terraform.lock.hcl and the plugin cache before any parallel process runs. Subsequent inits read the lock file and symlink from the cache — no concurrent writes.

This works but is a hack. It requires the caller to understand Terraform internals (TF_DATA_DIR, TF_PLUGIN_CACHE_DIR) and manage temp directories, cleanup, and process lifecycle outside of atmos.

Test 4: provision.workdir.enabled: true (atmos native feature — DOES NOT WORK for this case)

After discovering the Component Workdir Isolation feature, we enabled it on all 12 components:

components/elasticache/redis-cluster-1:
  metadata:
    component: elasticache
  provision:
    workdir:
      enabled: true

Then ran a simple parallel apply with no TF_DATA_DIR workarounds:

for component in "${COMPONENTS[@]}"; do
  atmos terraform apply "$component" -s "$STACK" &
done

Result: 1/12 succeeded, 11/12 failed. All 12 components resolved to the exact same workdir:

.workdir/terraform/<stack>-elasticache

The workdir path is derived from <stack>-<component>, where <component> is the metadata.component value (elasticache). Since all 12 instances share the same stack and the same base component, they all map to one directory.

The 11 failures all hit the same local state lock:

Error: Error locking state: Error acquiring the state lock
Error message: resource temporarily unavailable
Lock Info:
  ID:        2c072bde-8527-00a3-49bb-7940faa90d7f
  Path:      .terraform/terraform.tfstate
  Operation: backend from plan
  Version:   1.14.3

The workdir feature solves a different problem: same component across different stacks (e.g. dev-vpc vs prod-vpc → different workdirs). It does not solve multiple instances of the same component within the same stack, because the workdir path doesn't incorporate the component instance path — only the base component name.

Current workaround

Our working solution is a bash script that combines TF_DATA_DIR + TF_PLUGIN_CACHE_DIR + a serial pre-init step. It works but shouldn't be necessary — atmos should handle this natively.

I'll drop here our shell script to apply it in parallel because I think someone else could take benefits on it:

#!/bin/bash
#
# Apply multiple atmos components in parallel.
#
# Usage:
#   ./scripts/multiple-applies.sh <stack> <component> [component ...]
#
# Example:
#   ./scripts/multiple-applies.sh my-stack-dev \
#     infrastructure/dev/us-west-2/elasticache/redis-cluster-1 \
#     infrastructure/dev/us-west-2/elasticache/redis-cluster-2

set -e

STACK="$1"; shift 2>/dev/null || true
COMPONENTS=("$@")

if [ -z "$STACK" ] || [ ${#COMPONENTS[@]} -eq 0 ]; then
  echo "Usage: $0 <stack> <component> [component ...]"
  exit 1
fi

WORK_DIR="/tmp/multiple-applies-$$"
LOG_DIR="${WORK_DIR}/logs"
mkdir -p "$LOG_DIR"

export TF_PLUGIN_CACHE_DIR="${WORK_DIR}/plugin-cache"
mkdir -p "$TF_PLUGIN_CACHE_DIR"

PIDS=()
NAMES=()

cleanup() {
  echo ""
  echo "Interrupted — killing background jobs..."
  for pid in "${PIDS[@]}"; do kill "$pid" 2>/dev/null; done
  wait
  exit 1
}
trap cleanup INT TERM

echo "Applying ${#COMPONENTS[@]} components to ${STACK} in parallel..."

# Single init to warm up plugin cache and .terraform.lock.hcl before parallel runs
echo "Initializing providers..."
TF_DATA_DIR="${WORK_DIR}/tf-data/$(basename "${COMPONENTS[0]}")" \
  atmos terraform init "${COMPONENTS[0]}" -s "$STACK"

for component in "${COMPONENTS[@]}"; do
  name=$(basename "$component")
  echo "Starting: ${name}"
  TF_DATA_DIR="${WORK_DIR}/tf-data/${name}" \
    atmos terraform apply "$component" -s "$STACK" \
    > "$LOG_DIR/${name}.log" 2>&1 &
  PIDS+=($!)
  NAMES+=("$name")
done

TOTAL=${#COMPONENTS[@]}
echo ""
echo "All ${TOTAL} applies launched. To follow a specific component:"
echo "  tail -f ${LOG_DIR}/<name>.log"
echo ""

FAILED=0
for i in "${!PIDS[@]}"; do
  if wait "${PIDS[$i]}"; then
    echo "[SUCCESS] ${NAMES[$i]} ($((i + 1))/${TOTAL})"
  else
    echo "[FAILED]  ${NAMES[$i]} ($((i + 1))/${TOTAL}) — see ${LOG_DIR}/${NAMES[$i]}.log"
    FAILED=$((FAILED + 1))
  fi
done

echo ""
echo "=========================================="
if [ "$FAILED" -eq 0 ]; then
  echo "All ${TOTAL} components applied successfully!"
else
  echo "${FAILED}/${TOTAL} components failed."
fi
echo "Logs: ${LOG_DIR}"
echo "Cleanup: rm -rf ${WORK_DIR}"
echo "=========================================="

exit "$FAILED"

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions