Skip to content

ci(e2e): rescue logs, disk precheck, attempt-namespaced artifacts#508

Merged
DorianZheng merged 1 commit into
mainfrom
ci/e2e-rescue-logs-disk-precheck
May 12, 2026
Merged

ci(e2e): rescue logs, disk precheck, attempt-namespaced artifacts#508
DorianZheng merged 1 commit into
mainfrom
ci/e2e-rescue-logs-disk-precheck

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Defense-in-depth after run 25726812554 lost all debug logs: actions-runner Worker process crashed mid-step with IOException: No space left on device : '/opt/actions-runner/_diag/Worker_*.log', and if: failure() on the upload step never fired because the Worker is what evaluates step conditions.
  • Tees integration-test output to /var/log/boxlite-ci/<run_id>-<run_attempt>/ (persists across Worker death), adds a Stage+Upload pair to rescue prior-run logs on the next run, and switches the upload guard to failure() || cancelled() so host reboot / concurrency cancel also ship logs.
  • Disk-space precheck validates df output is numeric before the -lt 20 compare (the original [ "" -lt 20 ] silently passed the safety check). All log dirs and artifact names are namespaced by <run_id>-<run_attempt> (reruns reuse run_id, so attempt-naming prevents clobber).
  • Pre-flight cleanup prunes only /opt/actions-runner/_diag/Worker_*.log and Runner_*.log older than 1 day. Build/image caches (~/.boxlite/, ~/.cargo/, target/) are never touched — they are why this runner is persistent.

Co-requisite ops change (already applied, not in this PR): AmazonSSMManagedInstanceCore attached to the boxlite-e2e-runner IAM role for SSM-based runner debugging.

Test plan

  • After merge, push-to-main event triggers e2e-test (this workflow's on: push includes .github/workflows/e2e-test.yml in the path filter).
  • Confirm Disk-space precheck step runs and reports ~419 GB free (post 50→500 GB EBS resize).
  • Confirm Stage prior-run logs for rescue runs and produces an e2e-test-logs-rescued-<rid>-<att> artifact (empty for a clean run; populated if a prior run died mid-step).
  • Force a failure (or rerun 25726812554) and confirm e2e-test-logs-<rid>-<att> artifact contains /var/log/boxlite-ci/<rid>-<att>/integration.log and does NOT duplicate the rescue dir contents.
  • PR path-filter caveat: this workflow's pull_request paths don't include .github/workflows/e2e-test.yml, so this PR will NOT trigger e2e-test under PR — full validation happens on the next push to main.

Run 25726812554 crashed the actions-runner Worker with ENOSPC while
writing /opt/actions-runner/_diag — and because the Worker died
mid-step, `if: failure()` on the Upload step never fired, losing the
debug logs entirely.

Defense-in-depth so worker death never destroys the debug surface again:

- Stage prior-run logs for rescue: moves /var/log/boxlite-ci/<id>/ from
  any dead-mid-step prior run into /tmp/boxlite-rescue/ for upload.
- Upload rescued prior-run logs: ships them as a separate artifact.
- Pre-flight runner cleanup: prunes /opt/actions-runner/_diag only
  (build/image caches stay — that's why this runner is persistent).
- Disk-space precheck: fails loud below 20GB free. Captures raw `df`
  output and validates `^[0-9]+$` before the `-lt` compare so a parse
  failure can't silently pass the safety check or misattribute itself
  as "Only  GB free".
- Run integration tests: tees output to /var/log/boxlite-ci/<id>-<att>/
  so logs survive Worker death.
- Upload-on-failure guard widened from `failure()` to
  `failure() || cancelled()` so concurrency cancellation / host reboot
  also ship logs. Path glob excludes /tmp/boxlite-rescue/ to avoid
  duplicating the rescue artifact's content.

All log dirs and artifact names are namespaced by
`<run_id>-<run_attempt>` because GitHub Actions reruns reuse run_id
and would otherwise clobber the prior attempt's logs.

Refs: run 25726812554 on main.
@DorianZheng DorianZheng added the e2e-test Triggers E2E integration tests on self-hosted runner label May 12, 2026
@DorianZheng DorianZheng merged commit be75a4a into main May 12, 2026
10 checks passed
@DorianZheng DorianZheng deleted the ci/e2e-rescue-logs-disk-precheck branch May 12, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e-test Triggers E2E integration tests on self-hosted runner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant