Skip to content

fix(#25821): refuse to chmod top-level dirs from 5 token-storage call sites#27384

Closed
YonganZhang wants to merge 1 commit into
NousResearch:mainfrom
YonganZhang:yongan/fix-25821-chmod-bricks-host
Closed

fix(#25821): refuse to chmod top-level dirs from 5 token-storage call sites#27384
YonganZhang wants to merge 1 commit into
NousResearch:mainfrom
YonganZhang:yongan/fix-25821-chmod-bricks-host

Conversation

@YonganZhang

Copy link
Copy Markdown

Summary

Fixes #25821.

The original issue describes a production outage caused by os.chmod(path.parent, 0o700) succeeding silently against / and stripping traversal permission from the root inode — bricking every non-root user on the host. Root cause took 5+ hours to isolate.

There are five call sites with the same shape:

  • tools/mcp_oauth.py — MCP OAuth token storage
  • agent/google_oauth.py — Google / Gemini OAuth
  • hermes_cli/auth.py ×3 — Hermes auth store, Qwen CLI tokens, HERMES_SHARED_AUTH_DIR shared token

All five become unsafe the moment a derived path resolves to a top-level directory (empty HERMES_HOME, env-var concat bug, etc.).

What this PR does

  1. Adds _secure_dir_safe(path) to hermes_cli/config.py, next to the existing _secure_dir. It:

    • Resolves the path first so symlink games can't smuggle / past a string check.
    • Refuses single-component paths (len(parts) < 2), the filesystem anchor (Path("/"), C:\), and a blocklist of known system roots: /etc /var /usr /home /root /opt /tmp /sys /proc /dev /boot /lib /lib64 /run /srv /mnt /media.
    • Always uses 0o700 — no HERMES_HOME_MODE override, because token stores must not be group-readable regardless of deployment profile.
    • Honors is_managed() for parity with the existing helper (NixOS module sets group-readable perms intentionally).
  2. Switches all five call sites to use the new helper. They each lose the surrounding try / except OSError / pass boilerplate because _secure_dir_safe already swallows OSError/NotImplementedError internally.

Behavior trip-tests (inline Python)

path.parent resolves to Before After
/home/alice/.hermes/auth chmod 0o700 ✅ chmod 0o700 ✅
/ chmod 0o700 — bricks host 🔴 silently refused
/etc chmod 0o700 — bricks /etc traversal 🔴 silently refused
symlink: /tmp/x/ chmod 0o700 — bricks host 🔴 resolve sees /, refused
/home (single mid-level) chmod 0o700 — disables /home/* for everyone 🔴 silently refused
C:\ (Windows anchor) no-op (chmod no-op on Windows) explicitly refused before chmod ✅
Non-existent ~/.hermes/auth OSError swallowed resolve raises → return without chmod ✅

Scope

cron/jobs.py has its own local _secure_dir that should probably be unified with the new safe variant, but is intentionally out of scope for this PR — its callers operate on CRON_DIR / OUTPUT_DIR, neither of which is derived from user-controlled env vars, so the bricking trigger doesn't apply there today. Happy to follow up with a unify-helpers PR if maintainers want it cleaned in one direction.

Related

The issue author offered to send a PR — opening this one in case it's still useful. Happy to iterate on the helper signature or scope as needed.

…storage call sites

Fixes NousResearch#25821.

`os.chmod(path.parent, 0o700)` is called on a derived path at five
token-storage sites without checking that `path.parent` is a sane
directory. If anything makes the resolution land at `/` (e.g.
`HERMES_HOME=/`, an env-var concat bug, or a path whose
`.parent.parent == .parent`), chmod silently succeeds and strips
traversal permission from the root inode, bricking the host: every
non-root user fails any path lookup with `EACCES`, taking out
systemd-resolved, systemd-networkd, syslog, every Docker container
that drops privileges, and graceful reboot. Root keeps working
(CAP_DAC_OVERRIDE) so the symptom cascades for hours before recovery
via rescue mode. (See the issue body for the full incident timeline.)

This adds `_secure_dir_safe(path)` to `hermes_cli/config.py` (next to the
existing `_secure_dir`) that:

  * Resolves the path first (so symlink games can't smuggle `/` past
    a string check).
  * Refuses single-component paths (`len(parts) < 2`), the filesystem
    anchor (`Path("/")`, `C:\\`), and a blocklist of known system
    roots (`/etc`, `/var`, `/usr`, `/home`, `/root`, `/opt`, `/tmp`,
    `/sys`, `/proc`, `/dev`, `/boot`, `/lib`, `/lib64`, `/run`,
    `/srv`, `/mnt`, `/media`).
  * Always uses 0o700, no `HERMES_HOME_MODE` override — token stores
    must not be traversable regardless of deployment profile.
  * Honors `is_managed()` for parity with the existing `_secure_dir`
    (NixOS module sets group-readable permissions).

Switches all five identified call sites to use the new helper:

  * `tools/mcp_oauth.py` (MCP OAuth token storage)
  * `agent/google_oauth.py` (Google/Gemini OAuth)
  * `hermes_cli/auth.py` ×3 (Hermes auth store, Qwen CLI tokens,
    `HERMES_SHARED_AUTH_DIR` nous shared token)

A misresolved `HERMES_HOME` is still a bug worth diagnosing — strict-mode
logger.error on top-level resolution is left as a follow-up. The
priority here is to make the catastrophic outcome ("brick the host
silently") unreachable from any of these five sites.

`cron/jobs.py` has its own local `_secure_dir` helper that should
probably be unified with this one, but is intentionally out of scope
for this PR (its callers operate on `CRON_DIR`/`OUTPUT_DIR` which are
not derived from user-controlled env vars, so the bricking trigger
does not apply there today).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/security Security vulnerability or hardening P1 High — major feature broken, no workaround area/auth Authentication, OAuth, credential pools comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins tool/mcp MCP client and OAuth labels May 17, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #25831 and #25841 — both are already open PRs addressing the same #25821 os.chmod(path.parent) vulnerability with a very similar approach (safe guard function + blocklist of system dirs at the same 5 call sites). See also #26003 which was already closed as a duplicate of these.

@YonganZhang

Copy link
Copy Markdown
Author

Closing as duplicate. @alt-glitch correctly flagged this overlaps with #25831 and #25841 — both already addressing #25821 with the same 5-site coverage and (in #25831's case) a tests/test_hermes_constants.py. My version put the helper in hermes_cli/config.py alongside the existing _secure_dir; the canonical PRs use hermes_constants.py instead, which is the more conservative module-placement choice.

Apologies for not running gh pr list --search "fixes #25821" before opening — would have caught this in 30 seconds.

I'll add an observation on #25841 in case the helper's blocklist is worth widening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P1 High — major feature broken, no workaround tool/mcp MCP client and OAuth type/security Security vulnerability or hardening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: os.chmod(path.parent, 0o700) bricks host when path.parent resolves to /

2 participants