ci(e2e): make runner instance persistence non-fatal#499
Merged
Conversation
Run 25603658679 surfaced a self-destructing CI loop: the workflow created a fresh EC2 runner, then terminated it because the GitHub App token couldn't write the EC2_E2E_INSTANCE_ID Actions Variable. Tag- based discovery already handles rediscovery on the next run, so the save is an optimization, not a correctness invariant. - Workflow: log a warning and continue when `gh variable set` fails, instead of terminating the freshly-launched instance. - Setup script: request `actions_variables: write` in the App manifest so newly-created Apps can persist the variable from day one (correct slug per the API permissions docs; `variables` is rejected). - Setup script: pre-check that the App slug isn't already taken before triggering manifest creation, with an interactive flow that opens the delete URL and re-probes after the user confirms — replaces the silent 120s OAuth-callback timeout that obscured the real failure. - Setup script: bump callback wait from 2 min to 5 min, print the expected next click, and surface common-cause guidance on timeout. The existing GitHub App on boxlite-ai/boxlite still needs Variables: write granted by its owner via the App settings UI; until then the workflow gracefully falls back to tag-based discovery.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EC2_E2E_INSTANCE_IDis now best-effort. Tag fallback (Name=boxlite-e2e) handles rediscovery, so a 403 fromgh variable setno longer terminates a healthy runner. Fixes the self-destruct loop in run 25603658679.actions_variables: writein the App manifest so newly-created Apps can write Actions Variables, pre-checks that the App slug is free before triggering manifest creation (avoids a silent 120s timeout when the slug is taken), and bumps the OAuth callback wait from 2 min to 5 min with clearer instructions.Test plan
e2e-testso the workflow runs.Start E2E Runnerjob reachesWait for runner to come onlineinstead of dying atgh variable set.running(notterminated) afterstart-runner:aws ec2 describe-instances --filters Name=tag:Name,Values=boxlite-e2e --query 'Reservations[*].Instances[*].[InstanceId,State.Name]'.STATE=running/STATE=stoppedfast path and skip the create-new branch.Variables: writeon the existing App, confirmEC2_E2E_INSTANCE_IDappears ingh variable list -R boxlite-ai/boxliteafter a successful run (until then, expect a::warning::instead).