Skip to content

feat(checkpoints): auto-prune orphan and stale shadow repos at startup#16303

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5de18fdc
Apr 27, 2026
Merged

feat(checkpoints): auto-prune orphan and stale shadow repos at startup#16303
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5de18fdc

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Stops ~/.hermes/checkpoints/ from growing forever (~12 GB / 1000+ repos typical on active machines) by adding opt-in startup cleanup of orphan and stale shadow git repos. Follow-up to #16286.

Why this is needed

Every working directory the agent ever touches gets its own shadow git repo at ~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/ (one-per-project, for /rollback). The per-repo _prune() in CheckpointManager is a literal no-op — the code comment says so: "For simplicity, we don't actually prune — git's pack mechanism handles this efficiently." Pack dedupe is great inside a repo but doesn't help with repos whose working dirs were deleted, moved, or were one-off /tmp directories. They sit on disk forever.

This was the biggest offender called out in #8685's field report (the NanoClaw port we closed during #3015 triage) but intentionally left out of #16286's scope. Now wired through the same auto_prune opt-in pattern.

Design

Rides on the same contract the sessions.auto_prune block uses (#13861 / #16286):

  • Opt-in default off. Users who rely on /rollback against long-ago sessions never lose data silently.
  • Idempotency marker. ~/.hermes/checkpoints/.last_prune holds the last-run epoch; subsequent calls within min_interval_hours short-circuit.
  • Never raises. Maintenance must not block interactive startup.
  • Two deletion criteria, in priority order:
    1. OrphanHERMES_WORKDIR marker inside the repo points to a path that no longer exists on disk.
    2. Stale — repo's newest in-repo mtime is older than retention_days (walks refs/objects/HEAD because git pack ops can leave the top-level dir mtime stale).

Wired into both CLI (HermesCLI.__init__) and gateway (GatewayRunner.__init__) startup hooks, right next to the existing session-maintenance block.

Changes

  • tools/checkpoint_manager.py (+201): prune_checkpoints(), maybe_auto_prune_checkpoints(), _read_workdir_marker(), _shadow_repo_newest_mtime() helpers. Returns {scanned, deleted_orphan, deleted_stale, errors, bytes_freed}.
  • hermes_cli/config.py (+13): checkpoints.auto_prune: false, retention_days: 30, delete_orphans: true, min_interval_hours: 24 in DEFAULT_CONFIG. No version bump needed (nested new keys are picked up by _deep_merge).
  • cli.py (+28): _run_checkpoint_auto_maintenance() helper + startup call.
  • gateway/run.py (+16): parallel startup block, shares the existing config loader.
  • website/docs/user-guide/checkpoints-and-rollback.md (+10): documents the new knobs.
  • tests/tools/test_checkpoint_manager.py (+190): 11 new tests — TestPruneCheckpoints (7) + TestMaybeAutoPruneCheckpoints (4).

Validation

Before After
Default behavior ~/.hermes/checkpoints/ grows forever Unchanged (opt-in)
checkpoints.auto_prune: true startup No cleanup Orphan + stale repos deleted, min_interval_hours gated
Targeted tests 11/11 pass (64/64 full file)
Related suites 245/245 pass (test_checkpoint_manager.py + test_hermes_state.py + test_checkpoint_resumption.py)
E2E with real import + temp HERMES_HOME Alive repo preserved, orphan + stale deleted, 100 KB reclaimed, interval gate honored, forced re-run works

Refs #3015. Closes the remaining disk-growth gap from #8685 that we left out of scope in #16286.

@github-actions

Copy link
Copy Markdown
Contributor

🚨 CRITICAL Supply Chain Risk Detected

This PR contains a pattern that has been used in real supply chain attacks. A maintainer must review the flagged code carefully before merging.

🚨 CRITICAL: Install-hook file added or modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Scanner only fires on high-signal indicators: .pth files, base64+exec/eval combos, subprocess with encoded commands, or install-hook files. Low-signal warnings were removed intentionally — if you're seeing this comment, the finding is worth inspecting.

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from #13861 / #16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs #3015 (session-file disk growth was fixed in #16286; this covers
the checkpoint side noted out-of-scope there).
@teknium1 teknium1 force-pushed the hermes/hermes-5de18fdc branch from a082462 to ecae599 Compare April 27, 2026 02:05
@github-actions

Copy link
Copy Markdown
Contributor

🚨 CRITICAL Supply Chain Risk Detected

This PR contains a pattern that has been used in real supply chain attacks. A maintainer must review the flagged code carefully before merging.

🚨 CRITICAL: Install-hook file added or modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Scanner only fires on high-signal indicators: .pth files, base64+exec/eval combos, subprocess with encoded commands, or install-hook files. Low-signal warnings were removed intentionally — if you're seeing this comment, the finding is worth inspecting.

@teknium1 teknium1 merged commit 478444c into main Apr 27, 2026
10 of 12 checks passed
@teknium1 teknium1 deleted the hermes/hermes-5de18fdc branch April 27, 2026 02:05
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels Apr 27, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
NousResearch#16303)

Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/.  The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever.  Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.

Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from NousResearch#13861 / NousResearch#16286:

- tools/checkpoint_manager.py: new prune_checkpoints() and
  maybe_auto_prune_checkpoints() helpers.  Deletes shadow repos that
  are orphan (HERMES_WORKDIR marker points to a path that no longer
  exists) or stale (newest in-repo mtime older than retention_days).
  Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
  runs once per min_interval_hours regardless of how many hermes
  processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
  retention_days / delete_orphans / min_interval_hours knobs.
  Default auto_prune: false so users who rely on /rollback against
  long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
  called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
  tracking, non-shadow dir skip, interval gating, corrupt marker
  recovery.

Refs NousResearch#3015 (session-file disk growth was fixed in NousResearch#16286; this covers
the checkpoint side noted out-of-scope there).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants