Skip to content

ci(e2e): add multi-AZ fallback for EC2 instance creation#491

Merged
DorianZheng merged 1 commit into
mainfrom
ci/e2e-multi-az-fallback
May 5, 2026
Merged

ci(e2e): add multi-AZ fallback for EC2 instance creation#491
DorianZheng merged 1 commit into
mainfrom
ci/e2e-multi-az-fallback

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Loop through comma-separated subnet IDs (across AZs) when launching a new EC2 instance, so an InsufficientInstanceCapacity error in one AZ automatically retries the next
  • Env var renamed from EC2_SUBNET_IDEC2_SUBNET_IDS; reads vars.AWS_SUBNET_IDS with fallback to vars.AWS_SUBNET_ID for backward compatibility
  • Each failed AZ emits a ::warning annotation; all-AZ failure emits ::error and exits

Setup required

Set the AWS_SUBNET_IDS repository variable to a comma-separated list of subnet IDs across different AZs in the same VPC, e.g.:

subnet-0c83ca11429698695,subnet-xxxxxxxxx,subnet-yyyyyyyyy

Test plan

  • Verify single-subnet backward compat: only AWS_SUBNET_ID is set (no AWS_SUBNET_IDS) → works as before
  • Set AWS_SUBNET_IDS with multiple subnets → first available AZ is used
  • Simulate capacity failure in first AZ → falls through to second subnet

When AWS lacks capacity in a single AZ, the E2E runner job fails with
InsufficientInstanceCapacity. Loop through comma-separated subnet IDs
(one per AZ) until launch succeeds.

Reads vars.AWS_SUBNET_IDS (comma-separated), falls back to the
existing vars.AWS_SUBNET_ID for backward compatibility.
@DorianZheng DorianZheng merged commit 7b5de14 into main May 5, 2026
9 checks passed
@DorianZheng DorianZheng deleted the ci/e2e-multi-az-fallback branch May 5, 2026 15:34
DorianZheng added a commit that referenced this pull request May 10, 2026
The workflow's multi-AZ fallback (#491) iterates over `AWS_SUBNET_IDS`,
but the setup script only saved one subnet (`AWS_SUBNET_ID`), so the
fallback degenerated to a single-AZ try. Run today hit
InsufficientInstanceCapacity for c8i.4xlarge in us-east-1f and gave up
even though five other AZs were available.

Enumerate every public subnet in the VPC, comma-join, and write
`AWS_SUBNET_IDS`. Delete the legacy `AWS_SUBNET_ID` variable so the
workflow's `vars.AWS_SUBNET_IDS || vars.AWS_SUBNET_ID` fallback can't
silently revert to a narrower pool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant