Skip to content

Managed/shared Hermes runtime: atomic writes recreate SKILL.md and .bundled_manifest as 0600, causing permission-denied failures #14181

@givani30

Description

@givani30

Title: Managed/shared Hermes runtime: atomic writes recreate SKILL.md and .bundled_manifest as 0600, causing permission-denied failures

Summary
In a managed/shared-runtime deployment, Hermes can recreate files inside HERMES_HOME with owner-private permissions (0600) even when the deployment expects group-shared access. This shows up most clearly for:

  • ~/.hermes/skills/**/SKILL.md written via skill_manage
  • ~/.hermes/skills/.bundled_manifest written via skills sync

On a system where the gateway runs as one user and interactive sessions may touch the same HERMES_HOME via another user in the same group, this leads to intermittent permission-denied failures when Hermes later scans or loads skills.

This is not just old drift: files are actively recreated with 0600 after being normalized.

Observed symptoms

  • Repeated permission-denied warnings when loading skills, e.g.:
    • Failed to parse skill file .../skills/smart-home/homeassistant-on-this-box/SKILL.md: [Errno 13] Permission denied
  • .bundled_manifest reappears as 0600 after normalization
  • SKILL.md files created/edited through Hermes can end up 0600

Environment

  • NixOS managed deployment
  • Shared HERMES_HOME under /var/lib/hermes/.hermes
  • Gateway/service runs as user hermes
  • Interactive SSH sessions may run Hermes as a different user but point at the same HERMES_HOME
  • Group-sharing expected via hermes:hermes ownership and group-readable/group-writable runtime files

Why this seems upstream, not just local policy
The local deployment shape is specific, but the file-creation bug is general:

  • atomic write helpers use tempfile.mkstemp(...)
  • mkstemp creates temp files with 0600
  • os.replace() preserves the temp file mode
  • result: target files silently collapse to 0600 unless chmod is explicitly restored after replace

Affected code paths

  1. tools/skill_manager_tool.py

Current atomic writer:

268|def _atomic_write_text(file_path: Path, content: str, encoding: str = "utf-8") -> None:
281|    file_path.parent.mkdir(parents=True, exist_ok=True)
282|    fd, temp_path = tempfile.mkstemp(
283|        dir=str(file_path.parent),
284|        prefix=f".{file_path.name}.tmp.",
285|        suffix="",
286|    )
287|    try:
288|        with os.fdopen(fd, "w", encoding=encoding) as f:
289|            f.write(content)
290|        os.replace(temp_path, file_path)

This path is used for SKILL.md writes and edits.

  1. tools/skills_sync.py

Current manifest writer:

79|def _write_manifest(entries: Dict[str, str]):
91|        fd, tmp_path = tempfile.mkstemp(
92|            dir=str(MANIFEST_FILE.parent),
93|            prefix=".bundled_manifest_",
94|            suffix=".tmp",
95|        )
96|        try:
97|            with os.fdopen(fd, "w", encoding="utf-8") as f:
98|                f.write(data)
99|                f.flush()
100|                os.fsync(f.fileno())
101|            os.replace(tmp_path, MANIFEST_FILE)

This recreates .bundled_manifest as 0600.

Relevant context in config code
The config layer already acknowledges that managed installs want different permissions:

222|def _secure_dir(path):
225|    Skipped in managed modethe NixOS module sets group-readable
226|    permissions (0750) so interactive users in the hermes group can
227|    share state with the gateway service.
...
273|def _secure_file(path):
276|    Skipped in managed modethe NixOS activation script sets
277|    group-readable permissions (0640) on config files.
282|    if is_managed() or _is_container():
283|        return

So Hermes already has the concept of managed/shared runtime semantics. The atomic-write paths just do not preserve those semantics.

Reproduction sketch

  1. Use a shared HERMES_HOME with group-based access (e.g. service user + interactive user in same group).
  2. In managed/shared mode, normalize skill files and manifest to group-readable/writable (e.g. 0660 or 0640 depending policy).
  3. Trigger one of:
    • create/edit/patch a local skill via skill_manage
    • run bundled skill sync that updates .bundled_manifest
  4. Observe that the rewritten file becomes 0600.
  5. A different process/user sharing the same runtime later fails to read it.

Expected behavior
In managed/shared installations, Hermes should preserve the deployment’s shared-runtime permission model after atomic writes. Rewritten runtime files should not silently fall back to owner-private 0600 unless that is explicitly the intended mode for that file.

Actual behavior
Atomic replacement recreates files with mkstemp’s default 0600 mode.

Suggested fix directions
Option A: make the atomic write helpers preserve target mode if the target already exists

  • stat existing file before replace
  • chmod the temp file (or final file) to match the existing mode

Option B: make atomic writes managed-aware

  • if is_managed():
    • use a managed/shared file mode policy (for example 0660 or 0640 depending file class)
    • apply chmod after os.replace

Option C: both

  • preserve existing mode when present
  • otherwise use a managed/shared default when in managed mode

At minimum, the following should probably stop being recreated as 0600 in managed mode:

  • skills/**/SKILL.md
  • skills/.bundled_manifest
  • similar runtime metadata written through atomic temp-file replacement

Why this matters
This breaks a valid deployment model Hermes already partially supports:

  • managed runtime
  • group-shared state
  • service user + interactive/operator access

Even if that deployment is not the default, Hermes already has managed-mode permission branches, so preserving file modes during atomic writes seems like the right invariant.

Local workaround used here
A local NixOS activation step was added to re-normalize:

  • /var/lib/hermes/.hermes/skills/**/SKILL.md -> 0660
  • /var/lib/hermes/.hermes/skills/.bundled_manifest -> 0660
  • runtime dirs -> 2770

That mitigates drift, but it is policy cleanup after the fact, not a source-level fix.

Potential follow-up
If useful, I can turn this into a PR by patching:

  • tools/skill_manager_tool.py::_atomic_write_text
  • tools/skills_sync.py::_write_manifest
    so they preserve existing mode and/or apply managed-mode-safe permissions after replace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/nixNix flake, NixOS module, container packagingtool/skillsSkills system (list, view, manage)type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions