ci(e2e): rescue logs, disk precheck, attempt-namespaced artifacts#508
Merged
Conversation
Run 25726812554 crashed the actions-runner Worker with ENOSPC while writing /opt/actions-runner/_diag — and because the Worker died mid-step, `if: failure()` on the Upload step never fired, losing the debug logs entirely. Defense-in-depth so worker death never destroys the debug surface again: - Stage prior-run logs for rescue: moves /var/log/boxlite-ci/<id>/ from any dead-mid-step prior run into /tmp/boxlite-rescue/ for upload. - Upload rescued prior-run logs: ships them as a separate artifact. - Pre-flight runner cleanup: prunes /opt/actions-runner/_diag only (build/image caches stay — that's why this runner is persistent). - Disk-space precheck: fails loud below 20GB free. Captures raw `df` output and validates `^[0-9]+$` before the `-lt` compare so a parse failure can't silently pass the safety check or misattribute itself as "Only GB free". - Run integration tests: tees output to /var/log/boxlite-ci/<id>-<att>/ so logs survive Worker death. - Upload-on-failure guard widened from `failure()` to `failure() || cancelled()` so concurrency cancellation / host reboot also ship logs. Path glob excludes /tmp/boxlite-rescue/ to avoid duplicating the rescue artifact's content. All log dirs and artifact names are namespaced by `<run_id>-<run_attempt>` because GitHub Actions reruns reuse run_id and would otherwise clobber the prior attempt's logs. Refs: run 25726812554 on main.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IOException: No space left on device : '/opt/actions-runner/_diag/Worker_*.log', andif: failure()on the upload step never fired because the Worker is what evaluates step conditions./var/log/boxlite-ci/<run_id>-<run_attempt>/(persists across Worker death), adds a Stage+Upload pair to rescue prior-run logs on the next run, and switches the upload guard tofailure() || cancelled()so host reboot / concurrency cancel also ship logs.dfoutput is numeric before the-lt 20compare (the original[ "" -lt 20 ]silently passed the safety check). All log dirs and artifact names are namespaced by<run_id>-<run_attempt>(reruns reuserun_id, so attempt-naming prevents clobber)./opt/actions-runner/_diag/Worker_*.logandRunner_*.logolder than 1 day. Build/image caches (~/.boxlite/,~/.cargo/,target/) are never touched — they are why this runner is persistent.Co-requisite ops change (already applied, not in this PR):
AmazonSSMManagedInstanceCoreattached to theboxlite-e2e-runnerIAM role for SSM-based runner debugging.Test plan
on: pushincludes.github/workflows/e2e-test.ymlin the path filter).Disk-space precheckstep runs and reports~419 GB free(post 50→500 GB EBS resize).Stage prior-run logs for rescueruns and produces ane2e-test-logs-rescued-<rid>-<att>artifact (empty for a clean run; populated if a prior run died mid-step).e2e-test-logs-<rid>-<att>artifact contains/var/log/boxlite-ci/<rid>-<att>/integration.logand does NOT duplicate the rescue dir contents.pull_requestpaths don't include.github/workflows/e2e-test.yml, so this PR will NOT trigger e2e-test under PR — full validation happens on the next push to main.