Skip to content

ci(e2e): move make setup from user-data to a job step#501

Merged
DorianZheng merged 1 commit into
mainfrom
ci/e2e-setup-in-job
May 10, 2026
Merged

ci(e2e): move make setup from user-data to a job step#501
DorianZheng merged 1 commit into
mainfrom
ci/e2e-setup-in-job

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

Move the heavy build-dependency install (make setup) out of EC2 user-data and into the e2e-tests job as a normal step. User-data shrinks to "install + start the actions-runner" (~30s) and no longer races the 180s Wait for runner to come online poll.

Root cause from run 25629760929: user-data hit the 180s wait timeout mid-cargo install prek, the always-on Stop E2E Runner job sent SIGTERM via aws ec2 stop-instances, cloud-init's run-once semaphore was consumed, instance was permanently broken (no runner systemd unit ever installed). Recovered evidence in /var/log/cloud-init-output.log from i-07e41b4444e0103dc showed compilation killed at Compiling reqwest v0.12.28.

Changes

  • .github/workflows/e2e-test.yml:
    • User-data: drop git clone, git pull, make setup, source ~/.cargo/env. Keep KVM module load + actions-runner download/configure/start. Apt installs trimmed to curl jq tar.
    • e2e-tests: add new step Install build dependencies (make setup) running scripts/setup/setup-ubuntu.sh before Run integration tests.
    • e2e-tests: bump timeout-minutes 35 → 50 to cover first-run setup on top of test runtime.

Test plan

  • PR labelled e2e-test so the workflow runs.
  • First run (fresh instance):
    • Wait for runner to come online succeeds well under 180s.
    • Install build dependencies step runs make setup with output streamed live; does NOT install prek (CI=true skip path).
    • Run integration tests runs.
    • Stop E2E Runner stops the instance.
  • Second run (instance reused via STATE=stopped fast path):
    • Runner re-registers in ~30s after aws ec2 start-instances.
    • Install build dependencies runs in seconds (idempotent fast-paths).
    • Tests run; instance stops.
  • At rest: exactly one EC2 instance tagged Name=boxlite-e2e exists, in stopped state.

Cleanup performed

  • Terminated i-07e41b4444e0103dc (the broken instance from the failing run); next workflow run will create a fresh instance with the new user-data.

User-data was running the full `make setup` (apt + rustup +
`cargo install cargo-nextest` + ~250 crates of `cargo install prek`)
on first boot, racing the 180s runner-online poll. The poll timed
out mid-cargo-install (~10 min in), the always-on `Stop E2E Runner`
job called `aws ec2 stop-instances`, SIGTERM killed the compile, and
cloud-init's run-once semaphore left the instance permanently
broken — direct evidence in the recovered cloud-init-output.log:

  Compiling reqwest v0.12.28
  Received signal 15 resulting in exit. Cause: subprocess.py _try_wait

Move setup out of user-data into the e2e-tests job:

- User-data shrinks to: apt install curl/jq/tar, download the
  actions-runner tarball, configure, install + start the systemd
  unit. Total: ~30s. The 180s wait is now comfortable.
- New "Install build dependencies (make setup)" step runs
  scripts/setup/setup-ubuntu.sh as a normal job step. Output streams
  to the Actions log; failures are visible without SSH/console
  hunting; CI=true skips the prek install per setup-common.sh:488,
  trimming ~10 min from the first-run cost.
- Bump e2e-tests timeout-minutes 35 -> 50 to cover first-run setup
  (~5-10 min) on top of the existing test budget. Subsequent runs
  reuse the persistent EBS volume and finish setup in seconds via
  setup-ubuntu.sh's idempotent fast-paths.

The previously-broken instance i-07e41b4444e0103dc has already been
terminated; the next workflow run will create a fresh one with the
new (small) user-data and proceed normally.
@DorianZheng DorianZheng merged commit 2daae30 into main May 10, 2026
10 checks passed
@DorianZheng DorianZheng deleted the ci/e2e-setup-in-job branch May 10, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant