Skip to content

chore(infra): extend disk-guard to cover bind-mount target roots#1026

Merged
noahgift merged 35 commits into
mainfrom
chore/disk-guard-bind-mount-coverage
May 14, 2026
Merged

chore(infra): extend disk-guard to cover bind-mount target roots#1026
noahgift merged 35 commits into
mainfrom
chore/disk-guard-bind-mount-coverage

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Root cause timeline

Fix

New helper `prune_bind_mount_target_roots` walks each root in `$BIND_MOUNT_ROOTS` (default `/mnt/nvme-raid0/targets/aprender-ci`):

  • Always removes `debug/` subdir (orphan from pre-isolation; no current workflow mounts it)
  • In nightly mode, removes PR# subdirs older than `STALE_DAYS` days
  • In pre-job mode, removes PR# subdirs older than 60 min (aggressive disk-recovery) — fresh in-flight dirs survive
  • Always preserves `main` subdir (push-to-main CI reuses it)

Space-separated `BIND_MOUNT_ROOTS` env var lets sibling fleets (sovereign-ci-paiml-mcp-agent-toolkit etc.) extend coverage via config only.

Deployment

Deployed to intel 2026-04-23T12:58Z alongside the PR #1001 version update (intel was still on an older build). Nightly dry-run confirmed no unexpected candidates under the new path after manual cleanup.

```
$ sudo md5sum /usr/local/bin/runner-disk-guard.sh
921e055c55a2c8f1838aac6809d60840 /usr/local/bin/runner-disk-guard.sh
$ md5sum scripts/runner-infra/runner-disk-guard.sh
921e055c55a2c8f1838aac6809d60840 scripts/runner-infra/runner-disk-guard.sh
```

Test plan

  • `bash -n` syntax-check passes
  • Manual nightly dry-run on intel emits expected log lines ("nightly: / at 61% …") with no unintended prunes
  • CI must pass (`ci / gate` + `workspace-test`)
  • Next full-disk recovery cycle should keep intel online without manual intervention

🤖 Generated with Claude Code

The disk-guard added in #1001 walked only /home/noah/data/actions-runner*/_work/*/target/
— runner-workspace target dirs totalling ~75G across 8 runners. The actual runner-disk-
fill source that took intel offline on 2026-04-23 was /mnt/nvme-raid0/targets/aprender-ci/*:
per-PR bind-mount target dirs from ci.yml's task-#134 isolation, holding 1.9T including a
359G orphan `debug/` dir from pre-isolation era. Disk-guard never touched them.

Adds new BIND_MOUNT_ROOTS (default `/mnt/nvme-raid0/targets/aprender-ci`) and a
prune_bind_mount_target_roots() helper:

- Always prunes `debug/` subdir (orphan, no current workflow bind-mounts it).
- Prunes PR# subdirs stale past a minute threshold (nightly: STALE_DAYS×24×60 min;
  pre-job: 60-min floor so fresh in-flight dirs survive full-disk recovery).
- Preserves `main` (push-to-main CI reuses it).

Space-separated BIND_MOUNT_ROOTS env var lets the same script cover sibling fleets
(sovereign-ci-paiml-mcp-agent-toolkit etc.) without code changes.

Deployed to intel 2026-04-23T12:58Z; nightly dry-run confirmed no unexpected prune
candidates under the new path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 23, 2026 13:01
noahgift added 28 commits May 12, 2026 10:04
@noahgift noahgift merged commit 51c5b43 into main May 14, 2026
10 checks passed
@noahgift noahgift deleted the chore/disk-guard-bind-mount-coverage branch May 14, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant