Skip to content

ci(e2e): make runner instance persistence non-fatal#499

Merged
DorianZheng merged 1 commit into
mainfrom
ci/e2e-runner-non-fatal-persist
May 10, 2026
Merged

ci(e2e): make runner instance persistence non-fatal#499
DorianZheng merged 1 commit into
mainfrom
ci/e2e-runner-non-fatal-persist

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Workflow: persistence of EC2_E2E_INSTANCE_ID is now best-effort. Tag fallback (Name=boxlite-e2e) handles rediscovery, so a 403 from gh variable set no longer terminates a healthy runner. Fixes the self-destruct loop in run 25603658679.
  • Setup script: requests actions_variables: write in the App manifest so newly-created Apps can write Actions Variables, pre-checks that the App slug is free before triggering manifest creation (avoids a silent 120s timeout when the slug is taken), and bumps the OAuth callback wait from 2 min to 5 min with clearer instructions.

Test plan

  • PR labelled e2e-test so the workflow runs.
  • Start E2E Runner job reaches Wait for runner to come online instead of dying at gh variable set.
  • EC2 instance is left in running (not terminated) after start-runner: aws ec2 describe-instances --filters Name=tag:Name,Values=boxlite-e2e --query 'Reservations[*].Instances[*].[InstanceId,State.Name]'.
  • Re-run the workflow without changes — subsequent runs should hit the STATE=running/STATE=stopped fast path and skip the create-new branch.
  • Once the App owner grants Variables: write on the existing App, confirm EC2_E2E_INSTANCE_ID appears in gh variable list -R boxlite-ai/boxlite after a successful run (until then, expect a ::warning:: instead).

Run 25603658679 surfaced a self-destructing CI loop: the workflow
created a fresh EC2 runner, then terminated it because the GitHub App
token couldn't write the EC2_E2E_INSTANCE_ID Actions Variable. Tag-
based discovery already handles rediscovery on the next run, so the
save is an optimization, not a correctness invariant.

- Workflow: log a warning and continue when `gh variable set` fails,
  instead of terminating the freshly-launched instance.
- Setup script: request `actions_variables: write` in the App manifest
  so newly-created Apps can persist the variable from day one (correct
  slug per the API permissions docs; `variables` is rejected).
- Setup script: pre-check that the App slug isn't already taken before
  triggering manifest creation, with an interactive flow that opens the
  delete URL and re-probes after the user confirms — replaces the
  silent 120s OAuth-callback timeout that obscured the real failure.
- Setup script: bump callback wait from 2 min to 5 min, print the
  expected next click, and surface common-cause guidance on timeout.

The existing GitHub App on boxlite-ai/boxlite still needs Variables:
write granted by its owner via the App settings UI; until then the
workflow gracefully falls back to tag-based discovery.
@DorianZheng DorianZheng merged commit bdf888f into main May 10, 2026
10 checks passed
@DorianZheng DorianZheng deleted the ci/e2e-runner-non-fatal-persist branch May 10, 2026 04:21
@DorianZheng DorianZheng restored the ci/e2e-runner-non-fatal-persist branch May 10, 2026 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant