ci(e2e): move make setup from user-data to a job step#501
Merged
Conversation
User-data was running the full `make setup` (apt + rustup + `cargo install cargo-nextest` + ~250 crates of `cargo install prek`) on first boot, racing the 180s runner-online poll. The poll timed out mid-cargo-install (~10 min in), the always-on `Stop E2E Runner` job called `aws ec2 stop-instances`, SIGTERM killed the compile, and cloud-init's run-once semaphore left the instance permanently broken — direct evidence in the recovered cloud-init-output.log: Compiling reqwest v0.12.28 Received signal 15 resulting in exit. Cause: subprocess.py _try_wait Move setup out of user-data into the e2e-tests job: - User-data shrinks to: apt install curl/jq/tar, download the actions-runner tarball, configure, install + start the systemd unit. Total: ~30s. The 180s wait is now comfortable. - New "Install build dependencies (make setup)" step runs scripts/setup/setup-ubuntu.sh as a normal job step. Output streams to the Actions log; failures are visible without SSH/console hunting; CI=true skips the prek install per setup-common.sh:488, trimming ~10 min from the first-run cost. - Bump e2e-tests timeout-minutes 35 -> 50 to cover first-run setup (~5-10 min) on top of the existing test budget. Subsequent runs reuse the persistent EBS volume and finish setup in seconds via setup-ubuntu.sh's idempotent fast-paths. The previously-broken instance i-07e41b4444e0103dc has already been terminated; the next workflow run will create a fresh one with the new (small) user-data and proceed normally.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Move the heavy build-dependency install (
make setup) out of EC2 user-data and into thee2e-testsjob as a normal step. User-data shrinks to "install + start the actions-runner" (~30s) and no longer races the 180sWait for runner to come onlinepoll.Root cause from run 25629760929: user-data hit the 180s wait timeout mid-
cargo install prek, the always-onStop E2E Runnerjob sent SIGTERM viaaws ec2 stop-instances, cloud-init's run-once semaphore was consumed, instance was permanently broken (no runner systemd unit ever installed). Recovered evidence in/var/log/cloud-init-output.logfromi-07e41b4444e0103dcshowed compilation killed atCompiling reqwest v0.12.28.Changes
.github/workflows/e2e-test.yml:git clone,git pull,make setup,source ~/.cargo/env. Keep KVM module load + actions-runner download/configure/start. Apt installs trimmed tocurl jq tar.Install build dependencies (make setup)runningscripts/setup/setup-ubuntu.shbeforeRun integration tests.timeout-minutes35 → 50 to cover first-run setup on top of test runtime.Test plan
e2e-testso the workflow runs.Wait for runner to come onlinesucceeds well under 180s.Install build dependenciesstep runsmake setupwith output streamed live; does NOT installprek(CI=true skip path).Run integration testsruns.Stop E2E Runnerstops the instance.STATE=stoppedfast path):aws ec2 start-instances.Install build dependenciesruns in seconds (idempotent fast-paths).Name=boxlite-e2eexists, instoppedstate.Cleanup performed
i-07e41b4444e0103dc(the broken instance from the failing run); next workflow run will create a fresh instance with the new user-data.