Summary
Five call sites do os.chmod(path.parent, 0o700) on a derived path without checking that path.parent is a sane directory. If anything makes the resolution land at / (e.g. HERMES_HOME=/, an env-var concat bug, or a path whose .parent.parent == .parent), the rule strips traversal permission from the root inode and bricks the entire host: every non-root user (systemd-resolve, systemd-network, syslog, nobody, …) fails any path lookup with EACCES, taking out DNS, networking, journald, rsyslog and every Docker container that drops privileges.
We hit this in production today. Root cause took 5+ hours to isolate because root keeps working (CAP_DAC_OVERRIDE) and the symptom is a cascade: systemd-resolved watchdogs out → restarts fail with 200/CHDIR → systemd-networkd and timesyncd follow → SSH+ICMP keep working → graceful reboot hangs → recovery via Hetzner rescue. Fix: chmod 755 / on the mounted FS.
We could not isolate the exact triggering call — no auditd was running — but the pattern is the same across all five sites, and the catastrophic failure mode (chmod("/", 0o700)) is reachable from any of them under the right env.
Affected call sites (HEAD 26933c2)
All share this shape:
path.parent.mkdir(parents=True, exist_ok=True)
try:
os.chmod(path.parent, 0o700)
except OSError:
pass
os.chmod raises no exception when chmodding /; it succeeds.
Trigger
Any of:
HERMES_HOME=/ (or other env vars consumed by _qwen_cli_auth_path / _credentials_path / _nous_shared_auth_dir set to /)
- A bug elsewhere that resolves a token storage path to a top-level filename (e.g.
Path("/auth.json"))
The path filter _safe_filename in tools/mcp_oauth.py sanitises the filename but not the directory, so a malformed HERMES_HOME is not caught.
Proposed fix
A single helper, used at every site:
def _secure_dir_safe(d: Path) -> None:
"""chmod 0o700 a dir if and only if it's safe to do so."""
d = d.resolve()
if d == Path("/") or d == Path(d.anchor) or len(d.parts) < 2:
# Refuse to chmod top-level dirs.
return
# Also refuse common system roots if anyone got here.
if d in (Path("/etc"), Path("/var"), Path("/usr"), Path("/home"), Path("/root"), Path("/opt"), Path("/tmp")):
return
try:
os.chmod(d, 0o700)
except OSError:
pass
Plus a startup assert / loud logger.error if get_hermes_home() returns Path("/") or any of the above — strict-mode would be better than silent corruption.
Related
Happy to send a PR if there's interest in this exact helper signature.
Summary
Five call sites do
os.chmod(path.parent, 0o700)on a derived path without checking thatpath.parentis a sane directory. If anything makes the resolution land at/(e.g.HERMES_HOME=/, an env-var concat bug, or a path whose.parent.parent == .parent), the rule strips traversal permission from the root inode and bricks the entire host: every non-root user (systemd-resolve,systemd-network,syslog,nobody, …) fails any path lookup withEACCES, taking out DNS, networking, journald, rsyslog and every Docker container that drops privileges.We hit this in production today. Root cause took 5+ hours to isolate because root keeps working (CAP_DAC_OVERRIDE) and the symptom is a cascade: systemd-resolved watchdogs out → restarts fail with
200/CHDIR→ systemd-networkd and timesyncd follow → SSH+ICMP keep working → graceful reboot hangs → recovery via Hetzner rescue. Fix:chmod 755 /on the mounted FS.We could not isolate the exact triggering call — no auditd was running — but the pattern is the same across all five sites, and the catastrophic failure mode (
chmod("/", 0o700)) is reachable from any of them under the right env.Affected call sites (HEAD
26933c2)tools/mcp_oauth.py:179—_write_json(MCP OAuth tokens)hermes_cli/auth.py:991—_save_auth_store(Hermes auth store)hermes_cli/auth.py:1573—_save_qwen_cli_tokenshermes_cli/auth.py:2991—_save_nous_shared_token(usesHERMES_SHARED_AUTH_DIR)agent/google_oauth.py:495—save_credentials(Google OAuth)All share this shape:
os.chmodraises no exception when chmodding/; it succeeds.Trigger
Any of:
HERMES_HOME=/(or other env vars consumed by_qwen_cli_auth_path/_credentials_path/_nous_shared_auth_dirset to/)Path("/auth.json"))The path filter
_safe_filenameintools/mcp_oauth.pysanitises the filename but not the directory, so a malformedHERMES_HOMEis not caught.Proposed fix
A single helper, used at every site:
Plus a startup assert / loud
logger.errorifget_hermes_home()returnsPath("/")or any of the above — strict-mode would be better than silent corruption.Related
_secure_dir()resets~/.hermesto 0700 (closed by fix(config): use 0o701 for HERMES_HOME to allow web server traversal #7003, but addressed only~/.hermes, not the five sites above).Happy to send a PR if there's interest in this exact helper signature.