chore(infra): self-hosted runner disk-guard automation#1001
Merged
Conversation
a5e9d35 to
9c3f6b3
Compare
Addresses 2026-04-22 outage where all 16 intel-clean-room runners went offline because / on intel hit 100% (3.5T/3.6T). Runner diag logs couldn't be written, so GitHub marked runners offline. Two layers of defence: - pre-job hook: aggressive target/ prune when disk >= 85% - nightly timer: prune target/ older than 7 days Scripts are runner-host-agnostic — install path and deployment recipe in scripts/runner-infra/README.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously `--pre-job` unconditionally rm -rf'd every target/ across the shared host whenever disk hit 85%, including target/ dirs on sibling runners that were mid-compile. This produced `rlib: No such file or directory` errors for hours whenever multiple runners ran simultaneously (observed across main CI and every cascade PR on 2026-04-22/23). Root cause: unconditional prune has no notion of "owned-by-active-job". Fix: before pruning each target/ dir, check for a Runner.Worker process whose cwd is inside that runner's tree. Skip if so. - `runner_has_active_worker` greps /proc/<pid>/cwd of Runner.Worker PIDs and matches them against the target/'s owning runner dir - `runner_dir_of` normalizes a target path to its /home/noah/data/actions-runner-N root via sed - On a shared host with ≥2 runners active, only idle runners' target/ dirs are pruned; actives keep their own target/ for the duration of their job Acceptable degenerate case: if ALL 16 runners happen to be active, prune frees 0 bytes. That's strictly better than clobbering in-progress builds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9c3f6b3 to
bbfe1c5
Compare
This was referenced Apr 23, 2026
noahgift
added a commit
that referenced
this pull request
May 14, 2026
The disk-guard added in #1001 walked only /home/noah/data/actions-runner*/_work/*/target/ — runner-workspace target dirs totalling ~75G across 8 runners. The actual runner-disk- fill source that took intel offline on 2026-04-23 was /mnt/nvme-raid0/targets/aprender-ci/*: per-PR bind-mount target dirs from ci.yml's task-#134 isolation, holding 1.9T including a 359G orphan `debug/` dir from pre-isolation era. Disk-guard never touched them. Adds new BIND_MOUNT_ROOTS (default `/mnt/nvme-raid0/targets/aprender-ci`) and a prune_bind_mount_target_roots() helper: - Always prunes `debug/` subdir (orphan, no current workflow bind-mounts it). - Prunes PR# subdirs stale past a minute threshold (nightly: STALE_DAYS×24×60 min; pre-job: 60-min floor so fresh in-flight dirs survive full-disk recovery). - Preserves `main` (push-to-main CI reuses it). Space-separated BIND_MOUNT_ROOTS env var lets the same script cover sibling fleets (sovereign-ci-paiml-mcp-agent-toolkit etc.) without code changes. Deployed to intel 2026-04-23T12:58Z; nightly dry-run confirmed no unexpected prune candidates under the new path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prevents the
/= 100% full failure class that took all 16intel-clean-room-*runners offline on 2026-04-22 (diagnosed viagh api /orgs/paiml/actions/runnersshowing 16/16 offline, root cause:/at 3.5T/3.6T → runner_diaglogs unwritable → GitHub marks runners offline).Two-layer defence installed on intel as of this commit:
runner-pre-job.sh, wired via each runner's existingACTIONS_RUNNER_HOOK_JOB_STARTED): when/≥ 85%, aggressively prune_work/*/target/before the job starts. Also retains the prior root-owned-file chown logic.runner-disk-guard.timer→.service): at 04:00 local, prune any_work/*/target/untouched for ≥ 7 days.Emergency recovery (2026-04-22) freed 1.3 TB on intel by stopping all 16 runners,
rm -rf _work, restarting. Without this automation the next fill-up would repeat that outage.Test plan
onlineon GitHub after recovery + script installsystemctl list-timersshowsrunner-disk-guard.timernext fire at 04:15 UTC (04:00 + randomized delay).envalready references/usr/local/bin/runner-pre-job.sh(verified on all 16)logger -t runner-disk-guardin syslog)🤖 Generated with Claude Code