fix: tui in container needs write permission when not running as root by deas · Pull Request #28851 · NousResearch/hermes-agent

deas · 2026-05-19T17:21:38Z

What does this PR do?

The user running the JavaScript process needs ui-tui/dist write permission:

docker run -it --rm -e HERMES_GID=1000 -e HERMES_UID=1000 docker.io/nousresearch/hermes-agent:v2026.5.16 --tui
...
/opt/hermes/ui-tui/node_modules/esbuild/lib/main.js:1748
  let error = new Error(text);
              ^

Error: Build failed with 1 error:
error: Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied

Related Issue

Fixes #

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Set proper permission on ui-tui/dist

How to Test

docker run -it --rm \
  -e HERMES_GID=1000 -e HERMES_UID=1000 \
  -v "$(pwd)/docker/entrypoint.sh:/opt/hermes/docker/entrypoint.sh" \
  docker.io/nousresearch/hermes-agent:v2026.5.16 --tui

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform:

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

This skill is broadly useful to most users (if bundled) — see Contributing Guide
SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

…d works (#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>

benbarclay · 2026-05-27T05:42:21Z

Salvaged onto current main in #33045 — thanks @deas.

Retargeted to docker/stage2-hook.sh (the .venv chown moved there during the s6-overlay rework — docker/entrypoint.sh is now a deprecated shim). Extended the chown set to mirror the Dockerfile's build-time chown (line 154): .venv, ui-tui, node_modules. Co-authored-by: preserves your attribution.

E2E validation — built an isolated TUI UID-remap harness that forces the build path (HERMES_TUI_FORCE_BUILD=1) and runs docker run -e HERMES_UID=1000 --tui:

	baseline (`origin/main`)	salvage
esbuild EACCES on `ui-tui/dist/entry.js`	✘ reproduced	✓ absent
`TUI build failed.`	✘ present	✓ absent
Container reaches TTY-check (expected non-interactive exit)	✘ crashes earlier	✓ yes
Container exit code	1	0

Bug reproduces cleanly on baseline and is gone on the salvage. Image smoke battery still passes 6/6, and the chown E2E from #19788 still passes 7/7.

Landed as commit 22eb4d1 on main.
#33045

…kworm-slim (#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from #19788 (6/6), TUI UID-remap E2E from #28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com>

* fix(skills-hub): show every catalog source on /docs/skills (skills.sh, ClawHub, browse.sh, OpenAI, …) (#32336) The Skills Hub page was stuck on a stale Feb 25 snapshot, showing only Built-in + Optional + Anthropic + LobeHub. The unified index already has 2078 skills from skills.sh / ClawHub / LobeHub / GitHub taps / Claude Marketplace, and BrowseShSource adds another ~330 — none of it was reaching the page. Changes: - website/scripts/extract-skills.py: read website/static/api/skills-index.json (the unified multi-source catalog, rebuilt twice daily) as the canonical external source. Keep the legacy skills/index-cache/ fallback for offline builds. Add friendly per-source labels (skills.sh, ClawHub, browse.sh, OpenAI, HuggingFace, Anthropic, LobeHub, etc.) and per-entry installCmd. - website/src/pages/skills/index.tsx: add source pills + ordering for the 11 new sources; render installCmd from the index entry. - website/scripts/prebuild.mjs: when no local skills-index.json exists, fetch the live one from hermes-agent.nousresearch.com so local 'npm run build' matches production without burning GitHub API quota. - scripts/build_skills_index.py: crawl BrowseShSource so browse.sh entries land in the unified index. Adjust source_order. - tools/skills_hub.py: GitHubSource.DEFAULT_TAPS — openai/skills moved its skills into skills/.curated/ and skills/.system/, so add both as explicit taps (the listing code skips dotted dirs by design). Drop VoltAgent/awesome-agent-skills (README-only, no SKILL.md files) and MiniMax-AI/cli (singular skill, not a tap directory). Net effect: github source jumps from 83 → 143 skills, with OpenAI properly included. - .github/workflows/deploy-site.yml: build the unified index BEFORE running extract-skills.py — previous order meant extract-skills always fell back to the legacy cache. Drop the 'skip if file exists' guard; the file is gitignored and must be rebuilt every deploy. - .github/workflows/skills-index.yml: drop the broken 'deploy-with-index' job (it cp'd 'landingpage/\*' which no longer exists, failing every cron run since the landingpage move). Replace it with a workflow_dispatch trigger of deploy-site.yml so the index refresh still reaches production on schedule. - website/docs/user-guide/features/skills.md: drop VoltAgent from the default-taps doc list to match the code. Before: 695 skills (Built-in 90, Optional 84, Anthropic 16, LobeHub 505). After: 2168 skills across 9 source pills, including the 1212 skills.sh entries the user expected to see. * fix(docker): propagate container env through s6 to cont-init and main CMD s6-overlay's /init scrubs the environment before invoking both /etc/cont-init.d/* scripts and the container's CMD wrapper. As a result, ENV directives from the Dockerfile (HERMES_HOME=/opt/data, HERMES_WEB_DIST, …) and compose-time `environment:` entries (HERMES_UID, HERMES_GID) never reached the scripts that actually use them. Three concrete failures observed on macOS Docker Desktop with `~/.hermes:/opt/data`: * stage2-hook.sh ran with HERMES_UID unset → no UID remap, hermes user stayed at UID 10000 instead of the host user's UID. * skills_sync.py (invoked from stage2-hook) ran with HERMES_HOME unset → get_hermes_home() fell back to Path.home()/.hermes, populating a shadow $HERMES_HOME/.hermes/skills tree on the mounted volume (visible on the host as ~/.hermes/.hermes/skills). * The main `hermes gateway run` process inherited HOME=/root from the /init context (s6-setuidgid doesn't update HOME), so libraries resolving XDG_STATE_HOME via $HOME tried to write to /root/.local/state/hermes/gateway-locks/ and failed with EACCES, preventing the Discord adapter from acquiring its bot-token lock. Three surgical changes restore correct env flow: 1. The auto-generated /etc/cont-init.d/01-hermes-setup wrapper now uses `#!/command/with-contenv sh`, matching the pattern already used by docker/cont-init.d/02-reconcile-profiles. The container env (Dockerfile ENV + compose `environment:`) now reaches stage2-hook.sh and the skills_sync.py subprocess it spawns. 2. docker/main-wrapper.sh also switches to `#!/command/with-contenv sh`. The container CMD (`gateway run`, `chat`, `setup`, …) now sees HERMES_HOME and the other container-level env vars. 3. docker/main-wrapper.sh exports HOME=/opt/data before `s6-setuidgid hermes`. with-contenv populates HOME from the /init context (/root); s6-setuidgid drops privileges but does not update HOME. The hermes user's home per /etc/passwd is /opt/data, so the explicit override matches passwd. No behavior change for the non-buggy paths: the s6-supervised services already used with-contenv, and HOME=/opt/data only affects processes that resolved $HOME-based paths to /root (silently broken). * feat(skills-hub): health checks, freshness badge, and a watchdog cron (#32345) Layered safety so the Skills Hub at /docs/skills stays in sync without silent rot. Three pieces: 1. build_skills_index.py — refuses to ship a degenerate index. EXPECTED_FLOORS per source (skills.sh ≥100, lobehub ≥100, clawhub ≥50, official ≥50, github ≥30, browse-sh ≥50) and MIN_TOTAL=1500. Any source collapsing to zero (the silent OpenAI breakage that hid for weeks) now fails the workflow loud — broken index never reaches the live site. 2. extract-skills.py + the React page — visible freshness signal. Sidecar website/src/data/skills-meta.json carries the index's generated_at timestamp, plus per-source counts. Skills Hub renders a 'Catalog refreshed N hours ago · auto-rebuilt twice daily' line under the hero copy. If the cron stalls, users see the staleness immediately. 3. .github/workflows/skills-index-freshness.yml — watchdog cron. Every 4 hours, fetches the live /docs/api/skills-index.json, validates shape, checks age (>26h is stale), checks the same per-source floors, and opens (or appends to) a GitHub issue when anything is off. The issue is title-prefixed [skills-index-watchdog] so subsequent failures append a comment instead of spamming new issues. Net effect: - A silent regression like 'OpenAI tap moved its skills' now fails the build instead of shipping a quietly broken catalog. - A stuck cron (like the landingpage breakage that ran red for weeks) now files an issue within 4 hours. - Users see how fresh the catalog is on the page itself. Test plan: - Local: built skills-meta.json from the live index → 'Catalog refreshed N minutes ago' rendered correctly in the static HTML. - Probe logic dry-run against the live index: total=2456, all 6 sources above floor, age 0.1h — issues=NONE. - Triggered skills-index.yml manually; both jobs green, deploy-site.yml dispatch fired. * chore: add krislidimo to AUTHOR_MAP for PR #29775 (#32434) * fix(telegram): tighten table row-group spacing and drop redundant first bullet The GFM → Telegram-row-group rewriter previously joined every line in every row with a blank line ("\n\n".join(rendered_rows)), which made multi-column tables explode into one-bullet-per-paragraph walls on mobile. It also emitted the row heading twice when the table had no row-label column: once as the standalone bold heading and once again as the first labeled bullet (heading == headers[0] == data_cells[0]). This commit: * Uses single newlines between the heading and its bullets within a row-group, and a blank line only BETWEEN row-groups. * Skips any bullet whose value duplicates the heading text when the table has no row-label column (the heading already carries that information). Tables WITH a row-label column are unaffected since the heading comes from the label cell and never duplicates a header. Updated existing test assertions accordingly and added two regression tests: one that reproduces the screenshot bug (wide five-column "Plays" comparison table) and one that pins the row-label-column behavior so the dedup logic doesn't accidentally swallow real data. tests/gateway/test_telegram_format.py: 101 passed * fix(subdirectory_hints): prevent loading AGENTS.md outside workspace SubdirectoryHintTracker was scanning directories outside the active working directory, allowing files like ~/.codex/AGENTS.md or ~/.claude/CLAUDE.md to be loaded and injected into the agent context. This causes cross-agent context contamination and instruction mixup. Add _is_ancestor_or_same() helper and a path boundary check in _is_valid_subdir(): only directories within the working directory tree (i.e. path.is_relative_to(working_dir)) are allowed. Also add exist_ok=True to mkdir() calls in new tests to prevent pytest-xdist race conditions when workers share the same tmp_path parent. Tests added: - test_outside_working_dir_rejected: verifies sibling dirs are blocked - test_outside_working_dir_absolute_path_rejected: verifies ~/.codex paths blocked - test_inside_workspace_subdir_allowed: verifies normal subdir access unaffected - test_sibling_repo_not_loaded_via_ancestor_walk: ancestor walk stays within workspace * harden: restrict markdown link schemes; parse untrusted XML with defusedxml Two small defensive-hardening changes: - web/src/components/Markdown.tsx: render links only for http(s)/mailto schemes; other schemes (javascript:, data:, vbscript:) are dropped to plain text so a crafted link in rendered content can't execute on click. - gateway/platforms/wecom_callback.py: parse the untrusted, pre-auth WeCom callback request body with defusedxml instead of xml.etree, blocking entity-expansion / billion-laughs (and XXE) on the parse path. defusedxml is already a dependency (uv.lock); response-building XML in wecom_crypto.py is unchanged (it is not parsed from untrusted input). Verified: dashboard typechecks and builds; defusedxml blocks an entity-expansion payload while valid WeCom envelopes still parse. * chore(wecom): make defusedxml dep acquireable and tolerant of absence Follow-up on top of @TheOnlyMika's #32155 cherry-pick. The defusedxml hardening import was unconditional, which would break the gateway for anyone running a WeComCallback adapter without the (transitive-only) defusedxml present. - Wrap the import in the same try/except pattern as aiohttp/httpx in the same file. Sets DEFUSEDXML_AVAILABLE flag. - Extend check_wecom_callback_requirements() to gate on the flag, so the gateway logs the actual missing dep and skips the adapter instead of crashing. - Add [wecom] extra to pyproject.toml with defusedxml==0.7.1. - Register platform.wecom_callback in tools/lazy_deps.py so users get prompted to install it on first WeComCallback configuration, same pattern as discord/slack/matrix. defusedxml is still the right call for pre-auth XML parsing — this commit just makes the dep declarative and recoverable instead of a hard import-time crash. * fix(cli): restore fallback paste collapse + handle long single-line pastes (#32447) Follow-up to #32087 after community report from @ethernet that 8000-char single-line pastes get dumped raw into the input box. A) Fallback regression revert paste_collapse_threshold_fallback default: 0 -> 5 #32087 disabled the fallback handler by default. The fallback path has been always-on with line_count >= 5 since #3065 (March 2026); the previous shape was the salvaged contributor's design and didn't match pre-existing behavior for terminals without bracketed paste support (Windows terminals, some SSH setups). Restoring the original on-by-default. B) Long single-line paste guard New config key: paste_collapse_char_threshold (default 2000) Bracketed-paste handler and fallback handler now BOTH collapse when line count >= line threshold OR total char length >= char threshold. Catches the case ethernet hit: ~8000 chars of minified JSON / log output on a single line dumped raw into the buffer. TUI mirrors the same config via uiStore.pasteCollapseChars. Set 0 to disable. Defaults verified: paste_collapse_threshold: 5 paste_collapse_threshold_fallback: 5 paste_collapse_char_threshold: 2000 Tests: tests/hermes_cli/test_config.py: 87/87 pass ui-tui useConfigSync.test.ts: 34/34 pass ui-tui useComposerState.test.ts: 9/9 pass tsc: 0 new errors in touched files * feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) * feat(mcp): Nous-approved MCP catalog with interactive picker Adds an optional-mcps/ directory mirroring optional-skills/: curated, Nous-approved MCP servers shipped with the repo but disabled by default. Presence in optional-mcps/ = approval. No community tier, no trust signals. Entries are added by merging a PR. New surface: hermes mcp Interactive catalog picker (default) hermes mcp catalog Plain-text list, scriptable hermes mcp install <name> Install a catalog entry Picker behavior: not installed -> install (clone/bootstrap if needed, prompt for creds) installed/off -> enable installed/on -> menu (disable / uninstall / reinstall) Manifest schema (manifest_version: 1) supports: - transport: stdio (command/args, ${INSTALL_DIR} substitution) or http (url) - install: optional git clone + bootstrap commands (for repos that need local venv setup, like the n8n bridge); omit for npx/uvx servers - auth: api_key (prompts -> ~/.hermes/.env), oauth (provider-mediated or native MCP), or none Catalog entries are never auto-updated. Users re-run `hermes mcp install` to refresh. Credentials always go to ~/.hermes/.env (the .env-is-for-secrets rule), never to per-server env blocks. Ships n8n as the reference manifest (https://github.com/CyberSamuraiX/hermes-n8n-mcp). Tests: 19 catalog tests + E2E install/uninstall round-trip via the shipped manifest. * feat(mcp): tool-selection checklist + Linear catalog entry Adds install-time tool selection so users only enable the MCP tools they actually want, and ships Linear as a second reference catalog entry to demonstrate the http+oauth path alongside n8n's stdio+api_key+git-bootstrap. Tool selection flow: install (clone/auth/credentials) -> probe server for available tools -> curses checklist with pre-checked rows -> write mcp_servers.<name>.tools.include Pre-check priority: 1. user's prior tools.include (reinstall preserves selection) 2. manifest's tools.default_enabled (curated subset) 3. all probed tools (default) Probe-failure fallback (server unreachable, OAuth not yet complete, backing service offline): - manifest declared default_enabled -> applied directly - no default declared -> no filter written (all-on when reachable) - both cases point user at hermes mcp configure <name> Manifest schema additions: tools: default_enabled: [list, of, tool, names] # optional Updates: - optional-mcps/linear/manifest.yaml -- new reference entry (http+oauth) - optional-mcps/n8n/manifest.yaml -- tools.default_enabled set to the 8 read-mostly tools; mutating tools (activate/deactivate, container_logs) pruned by default - docs: new 'Tool selection at install time' section in features/mcp.md Tests: 7 new tests in TestToolSelection covering probe-success / probe-fail matrix, manifest-default filtering, reinstall-preserves-selection, and invalid-default-enabled rejection. 26 catalog tests + 32 existing mcp_config tests passing. * feat(mcp): polish — picker unification, include-mode convergence, hardening Addresses review findings on PR #30870. Lands all improvements that belong in this PR before merge; defers separate cleanup (consolidating two probe implementations, change-detector tests) to follow-ups. Picker UX (mcp_picker.py) - Unifies catalog + custom (user-added) MCPs in one view with distinct status badges (available / enabled / installed (disabled) / custom — enabled / custom — disabled) - Adds 'Configure tools (probe server + re-pick)' action to both the catalog-installed and custom-row submenus — the existing hermes mcp configure flow was previously unreachable from the picker - Loops until ESC/q so the user can manage several entries in one session instead of having to re-launch - Uninstall message now mentions .env credentials are preserved with a pointer to clean them up manually if no longer needed - Surfaces a 'requires a newer Hermes' warning per future-manifest entry instead of silently hiding it Catalog (mcp_catalog.py) - catalog_diagnostics() exposes which manifests were skipped and why (future_manifest vs invalid) so UIs can give actionable feedback - _do_git_install detects SHA-shaped refs (regex /[0-9a-f]{7,40}/) and skips the doomed 'git clone --branch <sha>' attempt — clone --branch only accepts branches/tags, so SHAs always failed noisily before falling back to the full-clone path - Probe-success all-tools-enabled message now mentions that new tools the server adds later will be auto-enabled (no-filter mode) Convergence (tools_config.py) - _configure_mcp_tools_interactive now writes tools.include (whitelist) instead of tools.exclude (blacklist), matching the catalog flow and hermes mcp configure. The on-disk config shape no longer depends on which UI the user touched last - Two existing tests updated to assert the new include-mode contract Discoverability - Setup wizard final step now prints 'Browse curated MCPs: hermes mcp' - Three tip-corpus entries pointing at the new catalog - Docs updated with: trust model (manifests run code locally, gated by PR review, but read before installing), runtime ${ENV_VAR} substitution semantics, and the manifest_version forward-compat behavior Tests - 7 new tests covering future-manifest diagnostics, custom MCP picker rows, SHA-ref git-install path, branch-ref git-install path, and the tools_config include-mode write contract - 80 MCP-related tests passing across test_mcp_catalog.py, test_mcp_config.py, test_mcp_tools_config.py * fix(mcp): drop setup-wizard catalog hint to satisfy supply-chain scanner The wizard line 'Browse curated MCPs: hermes mcp' triggered the CI supply-chain scanner because it pattern-matches on edits to any file named hermes_cli/setup.py — that filename matches the Python 'install-hook file' heuristic even though this setup.py is the user-facing 'hermes setup' wizard, not a packaging install hook. The catalog is already surfaced via three tip-corpus entries in hermes_cli/tips.py (which the scanner doesn't flag), so dropping the wizard mention loses no discoverability. Worth revisiting after a scanner allowlist for this specific file lands. * chore(models): swap qwen3.6-plus → qwen3.7-max in openrouter+nous lists (#32809) Updates curated picker lists for both the OpenRouter fallback snapshot (`OPENROUTER_MODELS`) and the Nous Portal list (`_PROVIDER_MODELS['nous']`). Regenerates website/static/api/model-catalog.json via `scripts/build_model_catalog.py` to keep the docs-hosted manifest in sync (drift guard in `test_in_repo_lists_match_manifest`). tests/hermes_cli/test_models.py fixtures updated — they pinned the old model id as their live-fetch sample. * fix(cron): clarify schedule is required for create in tool schema Grok models (and other LLMs) sometimes omit the schedule parameter when calling the cronjob tool with action=create because the schema only listed 'action' in required[] and the schedule description did not explicitly state it was mandatory (issue #32427). Fix: update schema descriptions to clearly state schedule is REQUIRED for action=create, making this explicit for models that rely on description text for parameter compliance. Fixes #32427 * test(cron): guard schedule-required description text on CRONJOB_SCHEMA * fix(gateway): refresh cached agent tools on /reload-mcp When the gateway processes /reload-mcp, it reconnects MCP servers and updates the global _servers registry, but cached AIAgent instances in _agent_cache keep the tools list they were built with. The user had to also run /new (discarding conversation history) before the agent could see the new tools — even though /reload-mcp had succeeded. This patch refreshes each cached agent's .tools and .valid_tool_names in _execute_mcp_reload after discovery returns, so existing sessions pick up new MCP tools on their next turn. The slash-confirm gate in _handle_reload_mcp_command already obtains user consent for the implied prompt-cache invalidation before this code runs. Mirrors the equivalent behaviour the CLI already does in cli.py _reload_mcp. Per-agent enabled_toolsets and disabled_toolsets are preserved so an agent that was scoped to a subset of toolsets does not silently gain disabled tools after the reload. Original diagnosis + initial implementation in #23812 from @fujinice. The auto-reload watcher half of that PR is intentionally dropped — users want /reload-mcp to remain explicit. Co-authored-by: fujinice <45688690+fujinice@users.noreply.github.com> * docs(auth): replace stale 'hermes login' references with 'hermes auth add' 'hermes login' was removed (the command now just prints a deprecation message and exits). The bundled hermes-agent SKILL.md, in-code error messages, the tip rotation, the proxy adapters, and the docs site still pointed agents and users at the dead command — so models loading the skill kept running 'hermes login --provider openai-codex' and getting a dead-end print. Replacements use the canonical 'hermes auth add <provider>' surface (or bare 'hermes auth' for the interactive manager). Files: - skills/autonomous-ai-agents/hermes-agent/SKILL.md (+ regenerated docs page) - hermes_cli/tips.py (tip rotation) - agent/google_oauth.py (gemini-cli error message) - agent/conversation_loop.py (nous re-auth troubleshooting line) - agent/credential_sources.py (docstring) - hermes_cli/proxy/cli.py + hermes_cli/proxy/adapters/nous_portal.py (proxy auth hints) - tests/hermes_cli/test_proxy.py (updated assertions) - website/docs/reference/faq.md, website/docs/user-guide/features/subscription-proxy.md - zh-Hans i18n mirrors for the above 'hermes logout' is still a live command and is left untouched. The 'hermes login' stub in hermes_cli/auth.py:login_command() and the cli-commands.md 'Deprecated' rows are intentionally kept as the discoverable deprecation surface. * fix(agent): recover Codex streams with null output * chore(release): map carltonawong noreply to GitHub login Added AUTHOR_MAP entry for the cherry-picked fix in the preceding commit so the release contributor audit can resolve Carlton's noreply email. * chore(release): map wangpuv contributor email for #32933 (#33005) Pre-stages the AUTHOR_MAP entry so the contributor-check workflow passes when Will Falcon's image-gen SSE fix lands. * fix: parse Codex image generation SSE directly * feat(opencode-go): route qwen3.7-max via anthropic_messages qwen3.7-max on OpenCode Go rejects the OpenAI-compatible (oa-compat) format with HTTP 401 but works correctly via the Anthropic Messages endpoint (/v1/messages with x-api-key auth). Route it the same way MiniMax models are routed: anthropic_messages api_mode. Changes: - hermes_cli/models.py: add qwen3.7-max routing + curated list - hermes_cli/setup.py: add to setup wizard model list - hermes_cli/auth.py: update provider comment - tests: add assertions for qwen3.7-max api_mode routing * feat: add TUI session orchestrator Add a first-class active-session orchestrator for the Ink TUI: - list, activate, close, and launch live process-local TUI sessions - hydrate committed and in-flight output when switching sessions - dispatch a new prompt session from the +new row with session-scoped model picks - expose a clickable live-session count in the status chrome - preserve stable row order while initially focusing the current session - support mouse hit-testing for floating orchestrator overlays - add backend and frontend regression coverage for the lifecycle and UI helpers * chore(release): map ticketclosed-wontfix noreply to GitHub login * refactor(docker): drop build-essential from apt install (#27507) build-essential is a Debian metapackage (libc6-dev + gcc + g++ + make + dpkg-dev). The Dockerfile already installs gcc + python3-dev + libffi-dev explicitly, which covers the C-ext compile cases lazy_deps may hit at first boot. g++/make/dpkg-dev aren't reached by the resolved [all]+[messaging] tree on current main — verified via uv sync --dry-run on cp313-linux. Co-authored-by: Monty Taylor <mordred@inaugust.com> * fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144) (#33035) * fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144) When an OpenAI-compatible Responses API surface accepts an initial request but later rejects the replayed `codex_reasoning_items` encrypted blob with HTTP 400 `invalid_encrypted_content`, the session previously got stuck retrying the same poisoned payload. Recovery: classify the error as a dedicated FailoverReason, and on the first hit disable encrypted reasoning replay for the rest of the session, strip cached items from message history, and retry once. Changes: * error_classifier: add FailoverReason.invalid_encrypted_content branch in _classify_400 (before context_overflow so the messages that mention 'encrypted content … could not be verified' don't trip context heuristics), in _classify_by_error_code, and extend _extract_error_code to peek inside wrapped JSON in error.message and ignore the bare '400' as a code. * agent_init: initialize `_codex_reasoning_replay_enabled = True` on every agent. * run_agent: add AIAgent._disable_codex_reasoning_replay() helper that flips the flag and pops cached items. * codex_responses_adapter: thread a `replay_encrypted_reasoning` kwarg through _chat_messages_to_responses_input so that when the flag is False we don't replay codex_reasoning_items. * transports/codex.py: read `replay_encrypted_reasoning` from params, thread it into the adapter, and gate the `include=['reasoning.encrypted_content']` request hint on it. * chat_completion_helpers: pass the agent's replay flag through to the transport. * conversation_loop: in the retry loop, add an invalid_encrypted_content recovery branch that fires once per session, only when api_mode == codex_responses, only when replay is still enabled, and only when at least one assistant message in history actually carries cached reasoning items (otherwise the 400 has nothing to do with our cache and the normal retry path handles it). Tests: * test_error_classifier: new wrapped-JSON _extract_error_code case; new TestClassifyApiError cases proving the 400 is retryable with no fallback, that the broad message match doesn't catch a generic 'parsed' message, and that the error code match is case-insensitive. * test_run_agent_codex_responses: end-to-end test of the recovery branch firing once and disabling replay, plus a sibling test that proves the branch does *not* fire (and the flag stays True) when history has no cached reasoning items. Salvages PR #10144 onto the post-refactor module layout (error_classifier / codex_responses_adapter / transports/codex / conversation_loop / agent_init) since the original diff was written against the pre-refactor monolithic run_agent.py. * chore(release): map victorGPT in AUTHOR_MAP for #10144 salvage --------- Co-authored-by: victorGPT <wuxuebin1993@gmail.com> * fix(docker): targeted chown to preserve host file ownership in HERMES_HOME (#19795) Replaces the recursive chown of $HERMES_HOME in stage2-hook.sh with a targeted approach: chown the top-level dir (so hermes can create new subdirs) plus the specific hermes-owned subdirectories (cron/, sessions/, logs/, hooks/, memories/, skills/, skins/, plans/, workspace/, home/, profiles/) — the same canonical list seeded by the s6-setuidgid mkdir -p block below. Avoids clobbering host-side file ownership when $HERMES_HOME is a bind mount that contains user-owned files not managed by hermes (issue #19788). Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the recursive chown moved during the s6-overlay rework. Co-authored-by: Ptichalouf <1809721+ptichalouf@users.noreply.github.com> * fix(docker): chown ui-tui and node_modules on UID remap so TUI esbuild works (#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com> * feat(docker): upgrade Node to 22 LTS via multi-stage from node:22-bookworm-slim (#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from #19788 (6/6), TUI UID-remap E2E from #28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com> * ci(docker): add shellcheck shell=sh directive to main-wrapper.sh shellcheck doesn't recognize the s6-overlay `#!/command/with-contenv sh` shebang and aborts with SC1008 ("This shebang was unrecognized. ShellCheck only supports sh/bash/dash/ksh/'busybox sh'. Add a 'shell' directive to specify."). The error fires at --severity=error too, so it fails the "Docker / shell lint" CI job on every PR that touches docker/. Add the canonical `# shellcheck shell=sh` directive — same fix already applied to the sibling cont-init.d scripts (`02-reconcile-profiles` and `015-supervise-perms`) when they adopted the with-contenv shebang. The shebang was changed from `#!/bin/sh` → `#!/command/with-contenv sh` in PR #32412 (commit 29c71e9) to fix env-propagation through s6's PID 1. The shellcheck-directive line was missed in that PR; this patches it. Reproduces locally: docker run --rm -v "$PWD:/mnt" -w /mnt koalaman/shellcheck:stable \ --severity=error --format=gcc docker/main-wrapper.sh Before: docker/main-wrapper.sh:1:1: error: [SC1008] (rc=1) After: (no output) (rc=0) Script behavior is unchanged — the directive is a comment, and `sh -n` / `bash -n` parse the file cleanly either way. * fix(docker): mkdir HERMES_HOME as root in stage2 before chown / privilege drop (#18488) When HERMES_HOME points at a custom path whose parent directories only root can create (e.g. HERMES_HOME=/home/hermes/.hermes in a Compose file, or any path under a fresh / not pre-populated by the image), stage2-hook.sh fails on first boot: [stage2] Warning: chown failed (rootless container?) - continuing mkdir: cannot create directory '/custom': Permission denied mkdir: cannot create directory '/custom': Permission denied ... (one per s6-setuidgid hermes mkdir invocation) cont-init: info: /etc/cont-init.d/01-hermes-setup exited 1 The mkdirs fail because s6-setuidgid drops to hermes (UID 10000) before invoking mkdir -p, and the runtime user has no permission to create root-owned ancestor directories. 02-reconcile-profiles then crashes with FileNotFoundError, .install_method never lands, and the container limps on in a half-initialized state. Bootstrap HERMES_HOME with mkdir -p while still root, before the ownership normalization. Idempotent on the default /opt/data path (directory already exists from the Dockerfile RUN mkdir -p) and on any subsequent restart. (#18482) Retargeted from the original PR's docker/entrypoint.sh (now a deprecated shim) to docker/stage2-hook.sh where the related chown logic moved during the s6-overlay rework. Co-authored-by: wpengpeng168 <133926080+wpengpeng168@users.noreply.github.com> * refactor(codex): drop SDK responses.stream() helper; consume events directly (#33042) * refactor(codex): drop SDK responses.stream() helper; consume events directly The OpenAI Python SDK's high-level `client.responses.stream(...)` helper does post-hoc typed reconstruction from the terminal `response.completed.response.output` field. The chatgpt.com Codex backend has been observed (today, gpt-5.5) to ship `response.output = null` on terminal frames, which crashes the SDK with `TypeError: 'NoneType' object is not iterable` mid-iteration. Carlton's #32963 patched the symptom by wrapping the helper in try/except and recovering from the same per-event accumulator the SDK was supposed to populate. This PR removes the helper from the call path entirely: we now use `client.responses.create(stream=True)` (raw AsyncIterable of SSE events) and assemble the final response object ourselves from `response.output_item.done` events as they arrive. The terminal event's `output` field is never read for content. Same strategy OpenClaw uses for the same backend. This makes Hermes structurally immune to the bug class, not patched. The next time OpenAI ships a shape change to chatgpt.com's terminal frame, our consumer keeps working because it doesn't read that frame for content — only for usage/status/id. Changes - `agent/codex_runtime.py`: new `_consume_codex_event_stream()` shared consumer; `run_codex_stream()` uses `responses.create(stream=True)`; `run_codex_create_stream_fallback()` collapses into a thin alias since the primary path now does what the fallback used to do. - `agent/auxiliary_client.py`: `_CodexCompletionsAdapter` uses the same consumer; old null-output recovery helpers deleted as unreferenced. - Tests migrated: fixtures that mocked `responses.stream` now mock `responses.create` returning a raw iterable. New regression test asserts the auxiliary path returns streamed items even when the terminal event's `output` is literally `null`. Validation - Live: tested against fresh OAuth on `chatgpt.com/backend-api/codex` with `gpt-5.5` — response built correctly with `response.output=null` on the terminal frame, all events consumed, usage/reasoning tokens propagated. - `tests/run_agent/test_run_agent_codex_responses.py` + `tests/agent/test_auxiliary_client.py`: 242 passed. * test+fix(codex): migrate streaming tests, raise on truncated streams CI surfaced 10 test failures across tests/run_agent/test_streaming.py and tests/run_agent/test_codex_xai_oauth_recovery.py — both files had their own `responses.stream(...)` mocks I missed in the first sweep. agent/codex_runtime.py: _consume_codex_event_stream() now raises "Codex Responses stream did not emit a terminal response" when the stream ends without any terminal frame AND no usable content. This preserves the signal callers used to get from the SDK's high-level helper, which they distinguished from "completed with empty body" in error handling. Tests migrated: - test_streaming.py: text-delta callback, activity-touch, and remote-protocol-error tests all switch from mocking responses.stream to responses.create returning an iterable of events. - test_codex_xai_oauth_recovery.py: prelude-error tests are recast as wire-error-event tests (the new path raises _StreamErrorEvent directly when the wire emits type=error, which is strictly better than the old two-phase "SDK RuntimeError → retry → fallback"). The retry-on-transport-error test moves from responses.stream side-effect to responses.create side-effect. Verified live against chatgpt.com Codex with gpt-5.5 — AIAgent.chat() through the full codex_responses path returns correctly, 319/319 targeted tests passing. * remove Vercel AI Gateway and Vercel Sandbox (#33067) * remove Vercel AI Gateway provider and Vercel Sandbox terminal backend Both Vercel-hosted integrations are removed end-to-end. Users on the AI Gateway should switch to OpenRouter or one of the other aggregators (Nous Portal, Kilo Code). Users on the Vercel Sandbox backend should switch to Docker, Modal, Daytona, or SSH. What's removed: - `plugins/model-providers/ai-gateway/` provider plugin - `hermes_cli/vercel_auth.py` Vercel-Sandbox auth helper - `tools/environments/vercel_sandbox.py` terminal backend - `ai-gateway` provider wiring across auth, doctor, setup, models, config, status, providers, main, web_server, model_normalize, dump - `vercel_sandbox` backend wiring across terminal_tool, file_tools, code_execution_tool, file_operations, approval, skills_tool, environments/local, credential_files, lazy_deps, prompt_builder, cli, gateway/run - `AI_GATEWAY_BASE_URL` constant, `_AI_GATEWAY_HEADERS` auxiliary-client header set, run_agent base-URL header/reasoning special-cases - `[vercel]` pyproject extra and `vercel`/`vercel-workers` from uv.lock - env vars: `AI_GATEWAY_API_KEY`, `AI_GATEWAY_BASE_URL`, `VERCEL_TOKEN`, `VERCEL_PROJECT_ID`, `VERCEL_TEAM_ID`, `VERCEL_OIDC_TOKEN`, `TERMINAL_VERCEL_RUNTIME` - Tests: deletes test_ai_gateway_models.py and test_vercel_sandbox_environment.py; scrubs references across 23 surviving test files (no entire tests deleted unless they were dedicated to AI Gateway / Sandbox) - Docs: provider tables, env-var reference, setup guides, security notes, tool config, terminal-backend tables — English plus zh-Hans i18n parity - `hermes-agent` skill: provider table entry and remote-backend list What stays (intentional): - `popular-web-designs/templates/vercel.md` — CSS design reference, unrelated to Vercel-the-AI-product - `x-vercel-id` in `stream_diag.py` headers — generic Vercel CDN response header, useful diag signal on any Vercel-hosted endpoint - `vercel-labs/agent-browser` URL in browser config — lightpanda browser project, different OSS effort - `userStories.json` historical contributor entry mentioning Vercel Sandbox — archive, not active docs Validation: - 1153 tests in the 22 targeted files pass (`scripts/run_tests.sh`) - Full repo `py_compile` clean - Live import of every touched module + invariant check (no `ai-gateway` in `PROVIDER_REGISTRY`, no `_AI_GATEWAY_HEADERS`, no `vercel_sandbox` in `_REMOTE_TERMINAL_BACKENDS`) * test: convert profile-count check from change-detector to invariant The hardcoded "== 34" assertion broke when ai-gateway was removed. Per AGENTS.md change-detector-test guidance, assert the relationship (registry count >= number of plugin dirs) instead of a literal count. Counts shift when providers are added/removed; that's expected. * feat(api-server): add GET /v1/skills and /v1/toolsets (#33016) Lets external clients enumerate the agent's skills and resolved toolsets deterministically over the OpenAI-compatible API server, without standing up the dashboard web server or sending a chat message and asking the model to list them. - GET /v1/skills — list installed skills (name, description, category) - GET /v1/toolsets — list toolsets resolved for the api_server platform, with enabled/configured state and the concrete tool names each expands to - Both gated by API_SERVER_KEY (same Bearer scheme as every other /v1/* endpoint) - /v1/capabilities advertises both new endpoints Closes the gap a community user just hit asking how to list skills over REST when only the OpenAI-compatible server is running. Test plan - python -m pytest tests/gateway/test_api_server.py -k "Skills or Toolsets or Capabilities" -o 'addopts=' -q → 9/9 pass - python -m pytest tests/gateway/test_api_server.py -o 'addopts=' -q → 156/156 pass, no regressions - E2E: started a real adapter on an isolated HERMES_HOME with a fake skill installed; curl-equivalent calls to /v1/capabilities, /v1/skills, /v1/toolsets returned the expected JSON; unauthenticated calls returned 401 with the configured API_SERVER_KEY. * feat(nix): add #messaging and #full package variants (#33108) * fix(plugins/discord): correct install_hint extra to [messaging] The Discord platform registered install_hint pointing at 'hermes-agent[discord]', but pyproject.toml has no [discord] extra — the deps live in [messaging] alongside Telegram and Slack. Users hitting "Platform 'Discord' requirements not met" were directed at a pip command that installs nothing. * feat(nix): add #messaging and #full package variants Make Discord/Telegram/Slack work out of the box for `nix profile install` users. Messaging deps were dropped from [all] on 2026-05-12 in favor of lazy-install, but lazy-install can't write to the read-only /nix/store — users hit "No adapter available for discord" with no actionable guidance. - #messaging: pre-built with discord.py/telegram/slack (+33 MB venv) - #full: all 18 platform-portable extras + matrix on Linux only (python-olm lacks Darwin PyPI wheels) (+738 MB venv) Also adds a `messaging-variant` flake check that verifies `import discord` succeeds in the sealed venv — regression guard for the lazy-install migration. Docs updated: Quick Start callout, extraDependencyGroups rewrite with messaging as primary example + full extras table, troubleshooting row, cheatsheet row. Closure size deltas (measured x86_64-linux): default 1792 MB pkg / 512 MB venv messaging 1826 MB pkg / 546 MB venv (+33 MB) full 2530 MB pkg / 1250 MB venv (+738 MB) * chore(nix): trim variant comments + alphabetize full extras Drop the date-stamped changelog from messaging-variant's comment and the "+33 MB / +704 MB" numbers from the variant defs — those drift and belong in the PR description, not source. Alphabetize the 18-extra list in #full so future additions produce clean one-line diffs. No semantic change. messaging-variant check still passes. * fix(codex): update silent-hang workaround hint * chore(release): map EvilHumphrey noreply for #33034 salvage * feat: add API server session controls * Support media in session chat API * chore(api-server): mark skills_api capability True now that /v1/skills shipped #33016 added GET /v1/skills + /v1/toolsets on the API server; the capability flag introduced in this branch was placeholder-False. Flip to True so capability probers see the truth. * feat(catalog): add qwen3.7-max to alibaba + alibaba-coding-plan model lists Alibaba's latest flagship Qwen model is released but not yet present in the DashScope (alibaba) or Alibaba Coding Plan curated catalogs. Add it so it shows up in the /model picker and setup wizard for those providers. OpenCode Go routing for qwen3.7-max already landed via #32780 (commit 2fc77c53f). OpenRouter + Nous catalog entries already landed via #32809 (commit ccd3d04fc). This salvage picks up the remaining alibaba / alibaba-coding-plan entries from #32806 — the AI Gateway entry is dropped because Vercel AI Gateway was removed in #33067. * test(codex): cover null output stream terminal events * chore(release): map superearn-fisher noreply for #33122 salvage * plugins: add security-guidance — pattern-matched warnings on dangerous code writes (#33131) New opt-in plugin that scans the content passed to write_file / patch / skill_manage for 25 known-dangerous code patterns — pickle.load, yaml.load, eval(, os.system, subprocess(shell=True), child_process.exec, dangerouslySetInnerHTML, innerHTML/outerHTML/document.write/ insertAdjacentHTML, crypto.createCipher (no IV), AES ECB, TLS verification disabled, XXE-prone xml.etree/minidom parsers, <script src=//...> without SRI, torch.load without weights_only=True, GitHub Actions ${{ github.event.* }} injection — and appends a "Security guidance" warning block to the tool result via the transform_tool_result hook. Default behaviour is non-blocking: the file is written and the warning rides back to the model in the next turn so it can self-correct or document why the construct is safe. SECURITY_GUIDANCE_BLOCK=1 upgrades to refusing the write entirely; SECURITY_GUIDANCE_DISABLE=1 is the kill switch. Pattern data (patterns.py) is a verbatim Apache-2.0 fork of Anthropic's claude-plugins-official/plugins/security-guidance/hooks/ patterns.py at commit 0bde168 (2026-05-26). LICENSE and NOTICE preserve attribution. The Hermes-side plugin glue (__init__.py, plugin.yaml, README.md, tests) is original work. Plugin is opt-in like all bundled plugins: hermes plugins enable security-guidance Inspired by https://x.com/ClaudeDevs/status/1927108527247... — Anthropic shipped this as their security-guidance plugin for Claude Code on 2026-05-26 with a measured 30-40% reduction in security-related PR comments on internal rollout. What's NOT ported (deferred): * Layer 2 (LLM diff review on turn end) — would route through main model by default on Hermes, real money on reasoning models. A follow-up can wire it to a cheap aux model with explicit opt-in. * Layer 3 (agentic commit-time review) — agent can run this on demand via delegate_task today. * .hermes/security-guidance.md project-rules file — only used by layers 2/3 upstream. * test(dashboard): pin current loopback auth behavior as regression harness Phase 0, Task 0.1 of the dashboard-oauth plan. Establishes a baseline for the loopback dashboard's auth surface so future phases can prove they didn't regress the existing _SESSION_TOKEN flow when adding the OAuth gate. * feat(dashboard): add should_require_auth predicate for OAuth gate Phase 0, Task 0.2. Single source of truth for 'is the auth gate active?'. Reuses the existing _LOOPBACK_HOST_VALUES frozenset so this stays in sync with the DNS-rebinding host-header check. RFC1918/CGNAT/link-local are treated as public — exact threat model the gate exists for. * feat(dashboard): stash auth_required flag on app.state Phase 0, Task 0.3. start_server now computes should_require_auth(host, allow_public) and records it on app.state.auth_required BEFORE the existing legacy SystemExit guard fires. This gives middleware, the SPA token-injection path, and WS endpoints a consistent read source for 'is the gate active'. The flag is set but no one reads it yet — Phase 3 registers the gate middleware. Note: 4 pre-existing test failures in tests/hermes_cli/test_web_server.py (PtyWebSocket) + test_update_hangup_protection.py reproduce on pristine HEAD and are unrelated to this change (starlette TestClient WS regression). * feat(dashboard-auth): define DashboardAuthProvider ABC + Session dataclass Phase 1, Task 1.1. New package hermes_cli/dashboard_auth/ contains: base.py - DashboardAuthProvider ABC with 5 abstract methods (start_login, complete_login, verify_session, refresh_session, revoke_session), Session + LoginStart frozen dataclasses, three exception types (ProviderError / InvalidCodeError / RefreshExpiredError), and assert_protocol_compliance() for plugins to call in their own tests. registry.py - Module-level register/get/list/clear with a lock. Nothing reads the registry yet — Phase 2 adds the StubAuthProvider and Phase 3 wires the gate middleware. The plugin hook lands in Task 1.3. * test(dashboard-auth): cover registry register/get/list/clear semantics Phase 1, Task 1.2. Verifies registration order is preserved, duplicate names are rejected with ValueError, and non-compliant providers fail at register time (not later when the middleware tries to dispatch). * feat(plugins): add register_dashboard_auth_provider hook on PluginContext Phase 1, Task 1.3. Mirrors the existing register_image_gen_provider pattern (plugins.py:531) — wrong-type or duplicate-name registrations log at WARNING and silently return rather than raising, so a misbehaving auth plugin cannot crash the host. Deviation from plan: the plan's draft raised TypeError on non-provider input; switched to silent-warn to match the established image_gen convention. Test updated to match. * feat(dashboard-auth): json-lines audit log at $HERMES_HOME/logs/dashboard-auth.log Phase 1, Task 1.4. Records every auth event (login start/success/failure, logout, refresh success/failure, revoke, session verify failure, WS ticket mint) as one JSON object per line. Token-like kwargs (access_token, refresh_token, code, code_verifier, state, ticket, cookie, Authorization) are dropped before serialisation so the log never contains live secrets. Write failures log at WARNING but never raise — auth flows must not fail because the audit logger broke. * test(dashboard-auth): stub auth provider for E2E gate testing Phase 2, Task 2.1. Self-contained fake IDP — start_login redirects straight back to {redirect_uri}?code=stub_code&state=<s> so tests can walk the OAuth round trip in-process. Tokens are HMAC-signed JSON blobs (not real JWTs) — enough structure for verify_session to detect tamper and expiry without pulling in pyjwt. Lives in tests/ only — never registered as a real plugin. Phase 3's end-to-end tests import StubAuthProvider directly. Convention: exp <= now counts as expired (TTL=0 means born-expired) — matches what Phase 6's silent-refresh test will need. * feat(dashboard-auth): cookie helpers for session_at/session_rt/pkce Phase 3, Task 3.1. Three cookies: - hermes_session_at: OAuth access token (HttpOnly, TTL = token TTL) - hermes_session_rt: OAuth refresh token (HttpOnly, 30d max-age) - hermes_session_pkce: PKCE state + verifier + provider hint (10min) All SameSite=Lax + Path=/. Secure flag is set ONLY when the request scheme is https — uvicorn proxy_headers=True (enabled in gated mode at Phase 3.5) rewrites scheme from X-Forwarded-Proto so Fly's TLS terminator works. * feat(dashboard-auth): auth gate middleware + /auth/* routes + /login HTML Phase 3, Tasks 3.2 + 3.3 + 3.4. These three pieces are mutually dependent so they land together. middleware.py - gated_auth_middleware engages when app.state.auth_required is True. Allowlists /login, /auth/*, /api/auth/providers, and static asset paths; everything else demands a valid session_at cookie. Verifies by trying every registered provider's verify_session in turn (multi- provider stack); attaches verified Session to request.state.session. Returns 401 JSON for /api/* and 302 -> /login for HTML. ProviderError during verify -> 503. routes.py - APIRouter with: GET /login server-rendered HTML GET /auth/login?provider=N 302 to IDP + PKCE cookie GET /auth/callback?code,state completes login, sets session cookies POST /auth/logout clears cookies + best-effort revoke GET /api/auth/providers public bootstrap endpoint (503 if zero) GET /api/auth/me verified session as JSON (auth-required) login_page.py - Inline-CSS HTML template, no React, no JavaScript. web_server.py - Mounted gated_auth_middleware between host_header and auth_middleware (FastAPI runs middlewares in registration order: host check -> cookie auth -> token auth). auth_middleware short-circuits when auth_required so cookie auth is authoritative in gated mode. Router is included before mount_spa so the catch-all doesn't swallow /login or /auth/*. 17 new behavioural tests; loopback regression harness still green. * feat(dashboard-auth): fail-closed on no providers; proxy_headers when gated; suppress _SESSION_TOKEN injection Phase 3, Task 3.5. Three changes to web_server.py: 1. start_server replaces the legacy SystemExit-refusing-to-bind guard with: if app.state.auth_required and no providers registered, exit with a clear message; otherwise log the gate-on banner. --insecure keeps its existing behaviour. 2. uvicorn proxy_headers flag is computed from app.state.auth_required. Loopback / --insecure keep it False (so _ws_client_is_allowed sees the real peer for the loopback gate); gated mode flips it True so X-Forwarded-Proto from Fly's TLS terminator is honoured for cookie Secure-flag decisions in detect_https(). 3. _serve_index no longer injects window.__HERMES_SESSION_TOKEN__ when the gate is on — the SPA reads identity from /api/auth/me using cookie auth instead. window.__HERMES_AUTH_REQUIRED__ flag lets the SPA pick between ticket-auth (gated) and token-auth (loopback) for /api/pty + /api/ws (Phase 5 will wire this in the React layer). 4 new behavioural tests; loopback regression harness still green. * docs(dashboard-auth): plan v2 — incorporate Portal OAuth contract (PR #180) Adds a 'Contract Anchor' section at the top of the plan summarizing the 11 material findings from nous-account-service PR #180's published contract. Rewrites Phase 4 (Nous provider) and Phase 6 (re-auth UX) in-place; the v1 drafts are preserved inline marked 'rejected — preserved for archeology' for reviewer context. Phases 0–3 (already shipped) are unaffected — they set up gate engagement and cookie plumbing only. The cookies module's RT cookie becomes dead in Phase 6 task 6.3 and is removed there. Key contract-driven reversals: - client_id is per-instance (agent:{id}), env-injected — not static - audience is bare client_id, not 'hermes-cli:' prefixed - scope is 'agent_dashboard:access' only - JWT claims do NOT include email/name — surface user_id instead - no refresh tokens in V1 — 401 → redirect to /login - JWKS-only verification, no userinfo fallback - redirect_uri is exact-match per AgentInstance, not wildcard Phase 7's AuthWidget needs to display user_id (truncated) instead of email; one-line annotation added at the top of that phase. * feat(dashboard-auth): plugins/dashboard_auth/nous — contract-compliant Nous OAuth provider Bundled, kind=backend, auto-loads. Activates ONLY when Portal-injected env vars are present: HERMES_DASHBOARD_OAUTH_CLIENT_ID — agent:{instance_id} HERMES_DASHBOARD_PORTAL_URL — Portal base URL Loopback / --insecure operators leave both unset and never see this plugin register anything. The fail-closed branch in start_server handles the 'public bind + zero providers' case independently. Implementation follows nous-account-service PR #180's published OAuth contract verbatim: - client_id is per-instance (agent:{instance_id}); the suffix is cross-checked against the token's agent_instance_id claim as defense-in-depth (contract C9). - scope is agent_dashboard:access only (contract C3). - aud is the bare client_id, no hermes-cli: prefix (contract C2). - RS256 JWT verification against /.well-known/jwks.json with 5-minute cache (contract C7). - No refresh tokens in V1: refresh_session always raises RefreshExpiredError; revoke_session is a no-op (contract C5). - oauth_contract_version claim: missing → warn + proceed; present and != 1 → refuse (contract C11, OQ-C2 tolerant treatment). - redirect_uri validated client-side as defense before bouncing to Portal; authoritative check is server-side per agent-redirect-uri.ts. 41 new tests covering construction, plugin-entry env gating, start_login shape, complete_login httpx-mocked happy path + error mapping, verify_session JWT verification (RSA keypair fixture, full claim-check matrix), refresh_session always raising, revoke_session no-op. PyJWT + cryptography are already in the venv (jose was previously suggested; switched to pyjwt[crypto] since the latter is already pulled in transitively). * feat(dashboard-auth): single-use WS tickets + POST /api/auth/ws-ticket Phase 5 task 5.1. Browsers cannot set Authorization on a WebSocket upgrade, so in gated mode the SPA needs an alternative way to bind the upgrade to its authenticated session. hermes_cli/dashboard_auth/ws_tickets.py — in-memory single-use ticket store with 30s TTL. Thread-safe (threading.Lock), token_urlsafe(32) values, ticket value truncated to 8 chars in error messages for log hygiene. Module-level state with _reset_for_tests() helper. hermes_cli/dashboard_auth/routes.py — adds POST /api/auth/ws-ticket. Auth-required (the gate middleware already attaches Session to request.state.session). Returns {ticket, ttl_seconds}; emits WS_TICKET_MINTED audit event with user_id + provider + ip. hermes_cli/dashboard_auth/audit.py — adds WS_TICKET_REJECTED enum value for the consume-side rejection event (wired into the WS endpoints in task 5.2). 11 new tests covering round-trip, single-use, TTL boundary, unknown ticket rejection, secret-hygiene truncation in error messages, and concurrent mint+consume from 20 threads. * feat(dashboard-auth): _ws_auth_ok helper + ticket auth on all 4 WS endpoints Phase 5 task 5.2. Four WebSocket endpoints — /api/pty, /api/ws, /api/pub, /api/events — previously authed with the same constant-time check against `_SESSION_TOKEN`. Replaced with a single helper that branches on `app.state.auth_required`: Loopback / --insecure: legacy ?token=<_SESSION_TOKEN> path (unchanged). Gated: ?ticket=<single-use> consumed against the dashboard-auth ticket store. Critical security property: gated mode UNCONDITIONALLY rejects the ?token= path. A leaked _SESSION_TOKEN value from a log line is not replayable for WS access in gated deployments. `_build_sidecar_url` now branches too: loopback uses the legacy token; gated mode mints a server-internal ticket via mint_ticket() with pseudo-user 'pty-sidecar' / provider 'server-internal' so audit logs can distinguish PTY-internal sidecar tickets from browser tickets. PTY children open /api/pub exactly once at startup so single-use suffices. Ticket rejections audit-log as WS_TICKET_REJECTED with truncated reason + client IP + WS path. Operators debugging 'WS keeps closing' issues see which endpoint and why. 17 new tests: - POST /api/auth/ws-ticket: 200 with cookie, 401/302 without, distinct per call, GET-not-allowed. - _ws_auth_ok loopback: token accept/reject, missing-token reject, ticket-param-ignored. - _ws_auth_ok gated: ticket accept, single-use rejection, unknown reject, legacy-token-rejected-in-gated assertion, audit-log emission. - _build_sidecar_url: loopback uses token=, gated uses ticket=, no-bound returns None. * feat(dashboard-auth): SPA WS auth — getWsTicket() + buildWsAuthParam() Phase 5 task 5.3. The dashboard's three WS-using surfaces (ChatPage, gatewayClient, ChatSidebar) previously hardcoded ?token=<session>. In gated mode the server rejects that path; the SPA must mint a single-use ticket via POST /api/auth/ws-ticket and pass ?ticket= on the upgrade. web/src/lib/api.ts: adds getWsTicket() (POST /api/auth/ws-ticket with credentials: 'include') and buildWsAuthParam() — a helper that returns ['ticket', <minted>] in gated mode and ['token', <session>] in loopback. Window.__HERMES_AUTH_REQUIRED__ is read from the server-injected bootstrap script and toggles the path. Documented as the bridge from cookie auth (REST) to WS auth. web/src/pages/ChatPage.tsx: buildWsUrl() now takes an [authName, authValue] pair instead of a bare token. The WS construct is wrapped in an IIFE so the outer effect can stay synchronous (the cleanup returns the effect's disposer at top level). onDataDisposable + onResizeDisposable hoisted to `let` bindings the cleanup closes over. web/src/lib/gatewayClient.ts: connect() branches on window.__HERMES_AUTH_REQUIRED__ before opening /api/ws. Explicit token overrides win (test-only path); otherwise gated → fetch ticket, loopback → use injected session token. web/src/components/ChatSidebar.tsx: events-feed WS opens through the same IIFE pattern as ChatPage. The ws local is hoisted so the cleanup's ws?.close() works after the async mint resolves. Server side already injects window.__HERMES_AUTH_REQUIRED__ in _serve_index (Phase 3.5). * feat(dashboard-auth): Phase 6 — 401 re-auth envelope + next= propagation Contract V1 of nous-account-service PR #180 ships no refresh tokens, so the original Phase 6 silent-refresh design is replaced with a thinner '401 → redirect to /login' UX. The dashboard's gated middleware now emits a structured envelope on any auth failure; the SPA's fetch wrapper sees it and full-page-navigates the user through re-auth. hermes_cli/dashboard_auth/cookies.py: set_session_cookies(refresh_token='') SKIPS writing the hermes_session_rt cookie. Forward-compat: a non-empty refresh_token still emits the cookie unchanged, so a future Portal contract that starts issuing RTs flips the persistence on with no other change. clear_session_cookies still emits a Max-Age=0 deletion for the RT cookie so stale cookies from earlier deployments get flushed on logout / session expiry. Deprecation marker + rationale in module docstring per the user's docstring-only deprecation pattern. hermes_cli/dashboard_auth/middleware.py: _unauth_response now builds a structured JSON envelope for API 401s: { error: 'session_expired' | 'unauthenticated', detail: 'Unauthorized', reason: <internal>, login_url: '/login?next=<safe-path>' } HTML redirects also carry next= so a user landing on /sessions without a cookie bounces back to /sessions after re-auth. _safe_next_target validates same-origin: drops protocol-relative paths (//evil.com), absolute URLs, and any /login or /auth/* loop. Dead cookies are cleared on the 401 path so the browser stops replaying invalid tokens. hermes_cli/dashboard_auth/routes.py: /auth/callback accepts next= query param and validates via _validate_post_login_target (same rules as the gate's _safe_next_target — defence-in-depth because next= survived a full IDP round trip and attacker-controlled state can re-enter via the callback URL). Open-redirect attempts land at '/' instead. web/src/lib/api.ts: fetchJSON parses the 401 envelope and full-page-navigates to body.login_url ONLY on the known session-expiry error codes. Domain-level 401s (e.g. permission errors) bubble up as regular errors. credentials: 'include' added so cookie auth works for all fetches routed through this wrapper. sessionStorage.lastLocation is preserved for future use by AuthWidget / hermes_status. Test files marked with pytest.mark.xdist_group so the four files that mutate web_server.app.state.auth_required serialize onto the same xdist worker — eliminates 'works locally, fails in CI' app-state bleed. 20 new tests in test_dashboard_auth_401_reauth.py: - set_session_cookies(refresh_token='') skips RT cookie - clear_session_cookies still emits RT deletion - 401 envelope shape (unauthenticated vs session_expired) - dead cookie cleared on invalid-token 401 - login_url carries next= for deep paths - login loop avoided when path is /login/auth/api-auth - protocol-relative URL rejected - _safe_next_target unit tests (accept same-origin, reject loops/abs) - /auth/callback respects safe next= but rejects open redirects 2 pre-existing tests updated to accept the new /login?next=%2F shape. Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing). * feat(dashboard-auth): Phase 7 — SPA AuthWidget + /api/status auth fields Phase 7 surfaces the OAuth gate state to users. web/src/components/AuthWidget.tsx (new): Sidebar widget that fetches /api/auth/me on mount and renders a …

…d works (NousResearch#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>

…kworm-slim (NousResearch#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from NousResearch#19788 (6/6), TUI UID-remap E2E from NousResearch#28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com>

…d works (NousResearch#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com> #AI commit#

…kworm-slim (NousResearch#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from NousResearch#19788 (6/6), TUI UID-remap E2E from NousResearch#28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com> #AI commit#

…d works (NousResearch#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>

…kworm-slim (NousResearch#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from NousResearch#19788 (6/6), TUI UID-remap E2E from NousResearch#28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com>

…d works (NousResearch#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>

…kworm-slim (NousResearch#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from NousResearch#19788 (6/6), TUI UID-remap E2E from NousResearch#28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com>

* fix(skills): reject symlinks in skill bundles before install * fix(skills-hub): show every catalog source on /docs/skills (skills.sh, ClawHub, browse.sh, OpenAI, …) (#32336) The Skills Hub page was stuck on a stale Feb 25 snapshot, showing only Built-in + Optional + Anthropic + LobeHub. The unified index already has 2078 skills from skills.sh / ClawHub / LobeHub / GitHub taps / Claude Marketplace, and BrowseShSource adds another ~330 — none of it was reaching the page. Changes: - website/scripts/extract-skills.py: read website/static/api/skills-index.json (the unified multi-source catalog, rebuilt twice daily) as the canonical external source. Keep the legacy skills/index-cache/ fallback for offline builds. Add friendly per-source labels (skills.sh, ClawHub, browse.sh, OpenAI, HuggingFace, Anthropic, LobeHub, etc.) and per-entry installCmd. - website/src/pages/skills/index.tsx: add source pills + ordering for the 11 new sources; render installCmd from the index entry. - website/scripts/prebuild.mjs: when no local skills-index.json exists, fetch the live one from hermes-agent.nousresearch.com so local 'npm run build' matches production without burning GitHub API quota. - scripts/build_skills_index.py: crawl BrowseShSource so browse.sh entries land in the unified index. Adjust source_order. - tools/skills_hub.py: GitHubSource.DEFAULT_TAPS — openai/skills moved its skills into skills/.curated/ and skills/.system/, so add both as explicit taps (the listing code skips dotted dirs by design). Drop VoltAgent/awesome-agent-skills (README-only, no SKILL.md files) and MiniMax-AI/cli (singular skill, not a tap directory). Net effect: github source jumps from 83 → 143 skills, with OpenAI properly included. - .github/workflows/deploy-site.yml: build the unified index BEFORE running extract-skills.py — previous order meant extract-skills always fell back to the legacy cache. Drop the 'skip if file exists' guard; the file is gitignored and must be rebuilt every deploy. - .github/workflows/skills-index.yml: drop the broken 'deploy-with-index' job (it cp'd 'landingpage/\*' which no longer exists, failing every cron run since the landingpage move). Replace it with a workflow_dispatch trigger of deploy-site.yml so the index refresh still reaches production on schedule. - website/docs/user-guide/features/skills.md: drop VoltAgent from the default-taps doc list to match the code. Before: 695 skills (Built-in 90, Optional 84, Anthropic 16, LobeHub 505). After: 2168 skills across 9 source pills, including the 1212 skills.sh entries the user expected to see. * fix(docker): propagate container env through s6 to cont-init and main CMD s6-overlay's /init scrubs the environment before invoking both /etc/cont-init.d/* scripts and the container's CMD wrapper. As a result, ENV directives from the Dockerfile (HERMES_HOME=/opt/data, HERMES_WEB_DIST, …) and compose-time `environment:` entries (HERMES_UID, HERMES_GID) never reached the scripts that actually use them. Three concrete failures observed on macOS Docker Desktop with `~/.hermes:/opt/data`: * stage2-hook.sh ran with HERMES_UID unset → no UID remap, hermes user stayed at UID 10000 instead of the host user's UID. * skills_sync.py (invoked from stage2-hook) ran with HERMES_HOME unset → get_hermes_home() fell back to Path.home()/.hermes, populating a shadow $HERMES_HOME/.hermes/skills tree on the mounted volume (visible on the host as ~/.hermes/.hermes/skills). * The main `hermes gateway run` process inherited HOME=/root from the /init context (s6-setuidgid doesn't update HOME), so libraries resolving XDG_STATE_HOME via $HOME tried to write to /root/.local/state/hermes/gateway-locks/ and failed with EACCES, preventing the Discord adapter from acquiring its bot-token lock. Three surgical changes restore correct env flow: 1. The auto-generated /etc/cont-init.d/01-hermes-setup wrapper now uses `#!/command/with-contenv sh`, matching the pattern already used by docker/cont-init.d/02-reconcile-profiles. The container env (Dockerfile ENV + compose `environment:`) now reaches stage2-hook.sh and the skills_sync.py subprocess it spawns. 2. docker/main-wrapper.sh also switches to `#!/command/with-contenv sh`. The container CMD (`gateway run`, `chat`, `setup`, …) now sees HERMES_HOME and the other container-level env vars. 3. docker/main-wrapper.sh exports HOME=/opt/data before `s6-setuidgid hermes`. with-contenv populates HOME from the /init context (/root); s6-setuidgid drops privileges but does not update HOME. The hermes user's home per /etc/passwd is /opt/data, so the explicit override matches passwd. No behavior change for the non-buggy paths: the s6-supervised services already used with-contenv, and HOME=/opt/data only affects processes that resolved $HOME-based paths to /root (silently broken). * feat(skills-hub): health checks, freshness badge, and a watchdog cron (#32345) Layered safety so the Skills Hub at /docs/skills stays in sync without silent rot. Three pieces: 1. build_skills_index.py — refuses to ship a degenerate index. EXPECTED_FLOORS per source (skills.sh ≥100, lobehub ≥100, clawhub ≥50, official ≥50, github ≥30, browse-sh ≥50) and MIN_TOTAL=1500. Any source collapsing to zero (the silent OpenAI breakage that hid for weeks) now fails the workflow loud — broken index never reaches the live site. 2. extract-skills.py + the React page — visible freshness signal. Sidecar website/src/data/skills-meta.json carries the index's generated_at timestamp, plus per-source counts. Skills Hub renders a 'Catalog refreshed N hours ago · auto-rebuilt twice daily' line under the hero copy. If the cron stalls, users see the staleness immediately. 3. .github/workflows/skills-index-freshness.yml — watchdog cron. Every 4 hours, fetches the live /docs/api/skills-index.json, validates shape, checks age (>26h is stale), checks the same per-source floors, and opens (or appends to) a GitHub issue when anything is off. The issue is title-prefixed [skills-index-watchdog] so subsequent failures append a comment instead of spamming new issues. Net effect: - A silent regression like 'OpenAI tap moved its skills' now fails the build instead of shipping a quietly broken catalog. - A stuck cron (like the landingpage breakage that ran red for weeks) now files an issue within 4 hours. - Users see how fresh the catalog is on the page itself. Test plan: - Local: built skills-meta.json from the live index → 'Catalog refreshed N minutes ago' rendered correctly in the static HTML. - Probe logic dry-run against the live index: total=2456, all 6 sources above floor, age 0.1h — issues=NONE. - Triggered skills-index.yml manually; both jobs green, deploy-site.yml dispatch fired. * chore: add krislidimo to AUTHOR_MAP for PR #29775 (#32434) * fix(telegram): tighten table row-group spacing and drop redundant first bullet The GFM → Telegram-row-group rewriter previously joined every line in every row with a blank line ("\n\n".join(rendered_rows)), which made multi-column tables explode into one-bullet-per-paragraph walls on mobile. It also emitted the row heading twice when the table had no row-label column: once as the standalone bold heading and once again as the first labeled bullet (heading == headers[0] == data_cells[0]). This commit: * Uses single newlines between the heading and its bullets within a row-group, and a blank line only BETWEEN row-groups. * Skips any bullet whose value duplicates the heading text when the table has no row-label column (the heading already carries that information). Tables WITH a row-label column are unaffected since the heading comes from the label cell and never duplicates a header. Updated existing test assertions accordingly and added two regression tests: one that reproduces the screenshot bug (wide five-column "Plays" comparison table) and one that pins the row-label-column behavior so the dedup logic doesn't accidentally swallow real data. tests/gateway/test_telegram_format.py: 101 passed * fix(subdirectory_hints): prevent loading AGENTS.md outside workspace SubdirectoryHintTracker was scanning directories outside the active working directory, allowing files like ~/.codex/AGENTS.md or ~/.claude/CLAUDE.md to be loaded and injected into the agent context. This causes cross-agent context contamination and instruction mixup. Add _is_ancestor_or_same() helper and a path boundary check in _is_valid_subdir(): only directories within the working directory tree (i.e. path.is_relative_to(working_dir)) are allowed. Also add exist_ok=True to mkdir() calls in new tests to prevent pytest-xdist race conditions when workers share the same tmp_path parent. Tests added: - test_outside_working_dir_rejected: verifies sibling dirs are blocked - test_outside_working_dir_absolute_path_rejected: verifies ~/.codex paths blocked - test_inside_workspace_subdir_allowed: verifies normal subdir access unaffected - test_sibling_repo_not_loaded_via_ancestor_walk: ancestor walk stays within workspace * harden: restrict markdown link schemes; parse untrusted XML with defusedxml Two small defensive-hardening changes: - web/src/components/Markdown.tsx: render links only for http(s)/mailto schemes; other schemes (javascript:, data:, vbscript:) are dropped to plain text so a crafted link in rendered content can't execute on click. - gateway/platforms/wecom_callback.py: parse the untrusted, pre-auth WeCom callback request body with defusedxml instead of xml.etree, blocking entity-expansion / billion-laughs (and XXE) on the parse path. defusedxml is already a dependency (uv.lock); response-building XML in wecom_crypto.py is unchanged (it is not parsed from untrusted input). Verified: dashboard typechecks and builds; defusedxml blocks an entity-expansion payload while valid WeCom envelopes still parse. * chore(wecom): make defusedxml dep acquireable and tolerant of absence Follow-up on top of @TheOnlyMika's #32155 cherry-pick. The defusedxml hardening import was unconditional, which would break the gateway for anyone running a WeComCallback adapter without the (transitive-only) defusedxml present. - Wrap the import in the same try/except pattern as aiohttp/httpx in the same file. Sets DEFUSEDXML_AVAILABLE flag. - Extend check_wecom_callback_requirements() to gate on the flag, so the gateway logs the actual missing dep and skips the adapter instead of crashing. - Add [wecom] extra to pyproject.toml with defusedxml==0.7.1. - Register platform.wecom_callback in tools/lazy_deps.py so users get prompted to install it on first WeComCallback configuration, same pattern as discord/slack/matrix. defusedxml is still the right call for pre-auth XML parsing — this commit just makes the dep declarative and recoverable instead of a hard import-time crash. * fix(cli): restore fallback paste collapse + handle long single-line pastes (#32447) Follow-up to #32087 after community report from @ethernet that 8000-char single-line pastes get dumped raw into the input box. A) Fallback regression revert paste_collapse_threshold_fallback default: 0 -> 5 #32087 disabled the fallback handler by default. The fallback path has been always-on with line_count >= 5 since #3065 (March 2026); the previous shape was the salvaged contributor's design and didn't match pre-existing behavior for terminals without bracketed paste support (Windows terminals, some SSH setups). Restoring the original on-by-default. B) Long single-line paste guard New config key: paste_collapse_char_threshold (default 2000) Bracketed-paste handler and fallback handler now BOTH collapse when line count >= line threshold OR total char length >= char threshold. Catches the case ethernet hit: ~8000 chars of minified JSON / log output on a single line dumped raw into the buffer. TUI mirrors the same config via uiStore.pasteCollapseChars. Set 0 to disable. Defaults verified: paste_collapse_threshold: 5 paste_collapse_threshold_fallback: 5 paste_collapse_char_threshold: 2000 Tests: tests/hermes_cli/test_config.py: 87/87 pass ui-tui useConfigSync.test.ts: 34/34 pass ui-tui useComposerState.test.ts: 9/9 pass tsc: 0 new errors in touched files * feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) * feat(mcp): Nous-approved MCP catalog with interactive picker Adds an optional-mcps/ directory mirroring optional-skills/: curated, Nous-approved MCP servers shipped with the repo but disabled by default. Presence in optional-mcps/ = approval. No community tier, no trust signals. Entries are added by merging a PR. New surface: hermes mcp Interactive catalog picker (default) hermes mcp catalog Plain-text list, scriptable hermes mcp install <name> Install a catalog entry Picker behavior: not installed -> install (clone/bootstrap if needed, prompt for creds) installed/off -> enable installed/on -> menu (disable / uninstall / reinstall) Manifest schema (manifest_version: 1) supports: - transport: stdio (command/args, ${INSTALL_DIR} substitution) or http (url) - install: optional git clone + bootstrap commands (for repos that need local venv setup, like the n8n bridge); omit for npx/uvx servers - auth: api_key (prompts -> ~/.hermes/.env), oauth (provider-mediated or native MCP), or none Catalog entries are never auto-updated. Users re-run `hermes mcp install` to refresh. Credentials always go to ~/.hermes/.env (the .env-is-for-secrets rule), never to per-server env blocks. Ships n8n as the reference manifest (https://github.com/CyberSamuraiX/hermes-n8n-mcp). Tests: 19 catalog tests + E2E install/uninstall round-trip via the shipped manifest. * feat(mcp): tool-selection checklist + Linear catalog entry Adds install-time tool selection so users only enable the MCP tools they actually want, and ships Linear as a second reference catalog entry to demonstrate the http+oauth path alongside n8n's stdio+api_key+git-bootstrap. Tool selection flow: install (clone/auth/credentials) -> probe server for available tools -> curses checklist with pre-checked rows -> write mcp_servers.<name>.tools.include Pre-check priority: 1. user's prior tools.include (reinstall preserves selection) 2. manifest's tools.default_enabled (curated subset) 3. all probed tools (default) Probe-failure fallback (server unreachable, OAuth not yet complete, backing service offline): - manifest declared default_enabled -> applied directly - no default declared -> no filter written (all-on when reachable) - both cases point user at hermes mcp configure <name> Manifest schema additions: tools: default_enabled: [list, of, tool, names] # optional Updates: - optional-mcps/linear/manifest.yaml -- new reference entry (http+oauth) - optional-mcps/n8n/manifest.yaml -- tools.default_enabled set to the 8 read-mostly tools; mutating tools (activate/deactivate, container_logs) pruned by default - docs: new 'Tool selection at install time' section in features/mcp.md Tests: 7 new tests in TestToolSelection covering probe-success / probe-fail matrix, manifest-default filtering, reinstall-preserves-selection, and invalid-default-enabled rejection. 26 catalog tests + 32 existing mcp_config tests passing. * feat(mcp): polish — picker unification, include-mode convergence, hardening Addresses review findings on PR #30870. Lands all improvements that belong in this PR before merge; defers separate cleanup (consolidating two probe implementations, change-detector tests) to follow-ups. Picker UX (mcp_picker.py) - Unifies catalog + custom (user-added) MCPs in one view with distinct status badges (available / enabled / installed (disabled) / custom — enabled / custom — disabled) - Adds 'Configure tools (probe server + re-pick)' action to both the catalog-installed and custom-row submenus — the existing hermes mcp configure flow was previously unreachable from the picker - Loops until ESC/q so the user can manage several entries in one session instead of having to re-launch - Uninstall message now mentions .env credentials are preserved with a pointer to clean them up manually if no longer needed - Surfaces a 'requires a newer Hermes' warning per future-manifest entry instead of silently hiding it Catalog (mcp_catalog.py) - catalog_diagnostics() exposes which manifests were skipped and why (future_manifest vs invalid) so UIs can give actionable feedback - _do_git_install detects SHA-shaped refs (regex /[0-9a-f]{7,40}/) and skips the doomed 'git clone --branch <sha>' attempt — clone --branch only accepts branches/tags, so SHAs always failed noisily before falling back to the full-clone path - Probe-success all-tools-enabled message now mentions that new tools the server adds later will be auto-enabled (no-filter mode) Convergence (tools_config.py) - _configure_mcp_tools_interactive now writes tools.include (whitelist) instead of tools.exclude (blacklist), matching the catalog flow and hermes mcp configure. The on-disk config shape no longer depends on which UI the user touched last - Two existing tests updated to assert the new include-mode contract Discoverability - Setup wizard final step now prints 'Browse curated MCPs: hermes mcp' - Three tip-corpus entries pointing at the new catalog - Docs updated with: trust model (manifests run code locally, gated by PR review, but read before installing), runtime ${ENV_VAR} substitution semantics, and the manifest_version forward-compat behavior Tests - 7 new tests covering future-manifest diagnostics, custom MCP picker rows, SHA-ref git-install path, branch-ref git-install path, and the tools_config include-mode write contract - 80 MCP-related tests passing across test_mcp_catalog.py, test_mcp_config.py, test_mcp_tools_config.py * fix(mcp): drop setup-wizard catalog hint to satisfy supply-chain scanner The wizard line 'Browse curated MCPs: hermes mcp' triggered the CI supply-chain scanner because it pattern-matches on edits to any file named hermes_cli/setup.py — that filename matches the Python 'install-hook file' heuristic even though this setup.py is the user-facing 'hermes setup' wizard, not a packaging install hook. The catalog is already surfaced via three tip-corpus entries in hermes_cli/tips.py (which the scanner doesn't flag), so dropping the wizard mention loses no discoverability. Worth revisiting after a scanner allowlist for this specific file lands. * chore(models): swap qwen3.6-plus → qwen3.7-max in openrouter+nous lists (#32809) Updates curated picker lists for both the OpenRouter fallback snapshot (`OPENROUTER_MODELS`) and the Nous Portal list (`_PROVIDER_MODELS['nous']`). Regenerates website/static/api/model-catalog.json via `scripts/build_model_catalog.py` to keep the docs-hosted manifest in sync (drift guard in `test_in_repo_lists_match_manifest`). tests/hermes_cli/test_models.py fixtures updated — they pinned the old model id as their live-fetch sample. * fix(cron): clarify schedule is required for create in tool schema Grok models (and other LLMs) sometimes omit the schedule parameter when calling the cronjob tool with action=create because the schema only listed 'action' in required[] and the schedule description did not explicitly state it was mandatory (issue #32427). Fix: update schema descriptions to clearly state schedule is REQUIRED for action=create, making this explicit for models that rely on description text for parameter compliance. Fixes #32427 * test(cron): guard schedule-required description text on CRONJOB_SCHEMA * fix(gateway): refresh cached agent tools on /reload-mcp When the gateway processes /reload-mcp, it reconnects MCP servers and updates the global _servers registry, but cached AIAgent instances in _agent_cache keep the tools list they were built with. The user had to also run /new (discarding conversation history) before the agent could see the new tools — even though /reload-mcp had succeeded. This patch refreshes each cached agent's .tools and .valid_tool_names in _execute_mcp_reload after discovery returns, so existing sessions pick up new MCP tools on their next turn. The slash-confirm gate in _handle_reload_mcp_command already obtains user consent for the implied prompt-cache invalidation before this code runs. Mirrors the equivalent behaviour the CLI already does in cli.py _reload_mcp. Per-agent enabled_toolsets and disabled_toolsets are preserved so an agent that was scoped to a subset of toolsets does not silently gain disabled tools after the reload. Original diagnosis + initial implementation in #23812 from @fujinice. The auto-reload watcher half of that PR is intentionally dropped — users want /reload-mcp to remain explicit. Co-authored-by: fujinice <45688690+fujinice@users.noreply.github.com> * docs(auth): replace stale 'hermes login' references with 'hermes auth add' 'hermes login' was removed (the command now just prints a deprecation message and exits). The bundled hermes-agent SKILL.md, in-code error messages, the tip rotation, the proxy adapters, and the docs site still pointed agents and users at the dead command — so models loading the skill kept running 'hermes login --provider openai-codex' and getting a dead-end print. Replacements use the canonical 'hermes auth add <provider>' surface (or bare 'hermes auth' for the interactive manager). Files: - skills/autonomous-ai-agents/hermes-agent/SKILL.md (+ regenerated docs page) - hermes_cli/tips.py (tip rotation) - agent/google_oauth.py (gemini-cli error message) - agent/conversation_loop.py (nous re-auth troubleshooting line) - agent/credential_sources.py (docstring) - hermes_cli/proxy/cli.py + hermes_cli/proxy/adapters/nous_portal.py (proxy auth hints) - tests/hermes_cli/test_proxy.py (updated assertions) - website/docs/reference/faq.md, website/docs/user-guide/features/subscription-proxy.md - zh-Hans i18n mirrors for the above 'hermes logout' is still a live command and is left untouched. The 'hermes login' stub in hermes_cli/auth.py:login_command() and the cli-commands.md 'Deprecated' rows are intentionally kept as the discoverable deprecation surface. * fix(agent): recover Codex streams with null output * chore(release): map carltonawong noreply to GitHub login Added AUTHOR_MAP entry for the cherry-picked fix in the preceding commit so the release contributor audit can resolve Carlton's noreply email. * chore(release): map wangpuv contributor email for #32933 (#33005) Pre-stages the AUTHOR_MAP entry so the contributor-check workflow passes when Will Falcon's image-gen SSE fix lands. * fix: parse Codex image generation SSE directly * feat(opencode-go): route qwen3.7-max via anthropic_messages qwen3.7-max on OpenCode Go rejects the OpenAI-compatible (oa-compat) format with HTTP 401 but works correctly via the Anthropic Messages endpoint (/v1/messages with x-api-key auth). Route it the same way MiniMax models are routed: anthropic_messages api_mode. Changes: - hermes_cli/models.py: add qwen3.7-max routing + curated list - hermes_cli/setup.py: add to setup wizard model list - hermes_cli/auth.py: update provider comment - tests: add assertions for qwen3.7-max api_mode routing * feat: add TUI session orchestrator Add a first-class active-session orchestrator for the Ink TUI: - list, activate, close, and launch live process-local TUI sessions - hydrate committed and in-flight output when switching sessions - dispatch a new prompt session from the +new row with session-scoped model picks - expose a clickable live-session count in the status chrome - preserve stable row order while initially focusing the current session - support mouse hit-testing for floating orchestrator overlays - add backend and frontend regression coverage for the lifecycle and UI helpers * chore(release): map ticketclosed-wontfix noreply to GitHub login * refactor(docker): drop build-essential from apt install (#27507) build-essential is a Debian metapackage (libc6-dev + gcc + g++ + make + dpkg-dev). The Dockerfile already installs gcc + python3-dev + libffi-dev explicitly, which covers the C-ext compile cases lazy_deps may hit at first boot. g++/make/dpkg-dev aren't reached by the resolved [all]+[messaging] tree on current main — verified via uv sync --dry-run on cp313-linux. Co-authored-by: Monty Taylor <mordred@inaugust.com> * fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144) (#33035) * fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144) When an OpenAI-compatible Responses API surface accepts an initial request but later rejects the replayed `codex_reasoning_items` encrypted blob with HTTP 400 `invalid_encrypted_content`, the session previously got stuck retrying the same poisoned payload. Recovery: classify the error as a dedicated FailoverReason, and on the first hit disable encrypted reasoning replay for the rest of the session, strip cached items from message history, and retry once. Changes: * error_classifier: add FailoverReason.invalid_encrypted_content branch in _classify_400 (before context_overflow so the messages that mention 'encrypted content … could not be verified' don't trip context heuristics), in _classify_by_error_code, and extend _extract_error_code to peek inside wrapped JSON in error.message and ignore the bare '400' as a code. * agent_init: initialize `_codex_reasoning_replay_enabled = True` on every agent. * run_agent: add AIAgent._disable_codex_reasoning_replay() helper that flips the flag and pops cached items. * codex_responses_adapter: thread a `replay_encrypted_reasoning` kwarg through _chat_messages_to_responses_input so that when the flag is False we don't replay codex_reasoning_items. * transports/codex.py: read `replay_encrypted_reasoning` from params, thread it into the adapter, and gate the `include=['reasoning.encrypted_content']` request hint on it. * chat_completion_helpers: pass the agent's replay flag through to the transport. * conversation_loop: in the retry loop, add an invalid_encrypted_content recovery branch that fires once per session, only when api_mode == codex_responses, only when replay is still enabled, and only when at least one assistant message in history actually carries cached reasoning items (otherwise the 400 has nothing to do with our cache and the normal retry path handles it). Tests: * test_error_classifier: new wrapped-JSON _extract_error_code case; new TestClassifyApiError cases proving the 400 is retryable with no fallback, that the broad message match doesn't catch a generic 'parsed' message, and that the error code match is case-insensitive. * test_run_agent_codex_responses: end-to-end test of the recovery branch firing once and disabling replay, plus a sibling test that proves the branch does *not* fire (and the flag stays True) when history has no cached reasoning items. Salvages PR #10144 onto the post-refactor module layout (error_classifier / codex_responses_adapter / transports/codex / conversation_loop / agent_init) since the original diff was written against the pre-refactor monolithic run_agent.py. * chore(release): map victorGPT in AUTHOR_MAP for #10144 salvage --------- Co-authored-by: victorGPT <wuxuebin1993@gmail.com> * fix(docker): targeted chown to preserve host file ownership in HERMES_HOME (#19795) Replaces the recursive chown of $HERMES_HOME in stage2-hook.sh with a targeted approach: chown the top-level dir (so hermes can create new subdirs) plus the specific hermes-owned subdirectories (cron/, sessions/, logs/, hooks/, memories/, skills/, skins/, plans/, workspace/, home/, profiles/) — the same canonical list seeded by the s6-setuidgid mkdir -p block below. Avoids clobbering host-side file ownership when $HERMES_HOME is a bind mount that contains user-owned files not managed by hermes (issue #19788). Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the recursive chown moved during the s6-overlay rework. Co-authored-by: Ptichalouf <1809721+ptichalouf@users.noreply.github.com> * fix(docker): chown ui-tui and node_modules on UID remap so TUI esbuild works (#28851) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com> * feat(docker): upgrade Node to 22 LTS via multi-stage from node:22-bookworm-slim (#4977) Debian trixie's bundled `nodejs` package is pinned to 20.19.2, which reached LTS EOL in April 2026. Trixie won't upgrade in place; Debian 14 (forky) — where the apt nodejs is 24.x — isn't released until ~mid-2027. To stay on a supported LTS without waiting for Debian 14, copy node + npm + corepack from the upstream `node:22-bookworm-slim` image as a multi-stage source, matching the existing `uv_source` and `gosu_source` patterns in the Dockerfile. Bookworm-based slim image is used so the produced binary links against glibc 2.36, which runs cleanly on Debian 13 (trixie, glibc 2.41). Changes: - Add `FROM node:22-bookworm-slim@sha256:... AS node_source` stage - Remove `nodejs npm` from `apt-get install` (now sourced from node_source) - Add `ca-certificates` explicitly to apt install (was a transitive of the apt nodejs package; removing nodejs broke the chain and curl inside the build failed with "error setting certificate file") - COPY node binary + npm + corepack from node_source; recreate the symlinks at /usr/local/bin/{npm,npx,corepack} - Update the npm_config_install_links=false comment block — npm 10's default is already `install-links=false`, but we keep the env as defense-in-depth against future Node-source-version regressions Future bumps to Node 24/26 are a one-line ARG change. Validation: - Built --no-cache against current origin/main; build succeeds in 1m42s - Image size: 3.27 GB (pre-salvage-1 baseline) → 3.14 GB (this PR); net 130 MiB savings (60 MiB from this change alone vs current main — removing apt nodejs+transitive deps that duplicated what node bundles) - Node 22.22.3 / npm 10.9.8 / esbuild 0.27.7 all run cleanly under trixie's glibc 2.41 - Standard image smoke (6/6), Node-version E2E (8/8), chown E2E from #19788 (6/6), TUI UID-remap E2E from #28851 (4/4) — 24 checks total Co-authored-by: Prithvi Monangi <8312237+Prithvi1994@users.noreply.github.com> * ci(docker): add shellcheck shell=sh directive to main-wrapper.sh shellcheck doesn't recognize the s6-overlay `#!/command/with-contenv sh` shebang and aborts with SC1008 ("This shebang was unrecognized. ShellCheck only supports sh/bash/dash/ksh/'busybox sh'. Add a 'shell' directive to specify."). The error fires at --severity=error too, so it fails the "Docker / shell lint" CI job on every PR that touches docker/. Add the canonical `# shellcheck shell=sh` directive — same fix already applied to the sibling cont-init.d scripts (`02-reconcile-profiles` and `015-supervise-perms`) when they adopted the with-contenv shebang. The shebang was changed from `#!/bin/sh` → `#!/command/with-contenv sh` in PR #32412 (commit 29c71e9) to fix env-propagation through s6's PID 1. The shellcheck-directive line was missed in that PR; this patches it. Reproduces locally: docker run --rm -v "$PWD:/mnt" -w /mnt koalaman/shellcheck:stable \ --severity=error --format=gcc docker/main-wrapper.sh Before: docker/main-wrapper.sh:1:1: error: [SC1008] (rc=1) After: (no output) (rc=0) Script behavior is unchanged — the directive is a comment, and `sh -n` / `bash -n` parse the file cleanly either way. * fix(docker): mkdir HERMES_HOME as root in stage2 before chown / privilege drop (#18488) When HERMES_HOME points at a custom path whose parent directories only root can create (e.g. HERMES_HOME=/home/hermes/.hermes in a Compose file, or any path under a fresh / not pre-populated by the image), stage2-hook.sh fails on first boot: [stage2] Warning: chown failed (rootless container?) - continuing mkdir: cannot create directory '/custom': Permission denied mkdir: cannot create directory '/custom': Permission denied ... (one per s6-setuidgid hermes mkdir invocation) cont-init: info: /etc/cont-init.d/01-hermes-setup exited 1 The mkdirs fail because s6-setuidgid drops to hermes (UID 10000) before invoking mkdir -p, and the runtime user has no permission to create root-owned ancestor directories. 02-reconcile-profiles then crashes with FileNotFoundError, .install_method never lands, and the container limps on in a half-initialized state. Bootstrap HERMES_HOME with mkdir -p while still root, before the ownership normalization. Idempotent on the default /opt/data path (directory already exists from the Dockerfile RUN mkdir -p) and on any subsequent restart. (#18482) Retargeted from the original PR's docker/entrypoint.sh (now a deprecated shim) to docker/stage2-hook.sh where the related chown logic moved during the s6-overlay rework. Co-authored-by: wpengpeng168 <133926080+wpengpeng168@users.noreply.github.com> * refactor(codex): drop SDK responses.stream() helper; consume events directly (#33042) * refactor(codex): drop SDK responses.stream() helper; consume events directly The OpenAI Python SDK's high-level `client.responses.stream(...)` helper does post-hoc typed reconstruction from the terminal `response.completed.response.output` field. The chatgpt.com Codex backend has been observed (today, gpt-5.5) to ship `response.output = null` on terminal frames, which crashes the SDK with `TypeError: 'NoneType' object is not iterable` mid-iteration. Carlton's #32963 patched the symptom by wrapping the helper in try/except and recovering from the same per-event accumulator the SDK was supposed to populate. This PR removes the helper from the call path entirely: we now use `client.responses.create(stream=True)` (raw AsyncIterable of SSE events) and assemble the final response object ourselves from `response.output_item.done` events as they arrive. The terminal event's `output` field is never read for content. Same strategy OpenClaw uses for the same backend. This makes Hermes structurally immune to the bug class, not patched. The next time OpenAI ships a shape change to chatgpt.com's terminal frame, our consumer keeps working because it doesn't read that frame for content — only for usage/status/id. Changes - `agent/codex_runtime.py`: new `_consume_codex_event_stream()` shared consumer; `run_codex_stream()` uses `responses.create(stream=True)`; `run_codex_create_stream_fallback()` collapses into a thin alias since the primary path now does what the fallback used to do. - `agent/auxiliary_client.py`: `_CodexCompletionsAdapter` uses the same consumer; old null-output recovery helpers deleted as unreferenced. - Tests migrated: fixtures that mocked `responses.stream` now mock `responses.create` returning a raw iterable. New regression test asserts the auxiliary path returns streamed items even when the terminal event's `output` is literally `null`. Validation - Live: tested against fresh OAuth on `chatgpt.com/backend-api/codex` with `gpt-5.5` — response built correctly with `response.output=null` on the terminal frame, all events consumed, usage/reasoning tokens propagated. - `tests/run_agent/test_run_agent_codex_responses.py` + `tests/agent/test_auxiliary_client.py`: 242 passed. * test+fix(codex): migrate streaming tests, raise on truncated streams CI surfaced 10 test failures across tests/run_agent/test_streaming.py and tests/run_agent/test_codex_xai_oauth_recovery.py — both files had their own `responses.stream(...)` mocks I missed in the first sweep. agent/codex_runtime.py: _consume_codex_event_stream() now raises "Codex Responses stream did not emit a terminal response" when the stream ends without any terminal frame AND no usable content. This preserves the signal callers used to get from the SDK's high-level helper, which they distinguished from "completed with empty body" in error handling. Tests migrated: - test_streaming.py: text-delta callback, activity-touch, and remote-protocol-error tests all switch from mocking responses.stream to responses.create returning an iterable of events. - test_codex_xai_oauth_recovery.py: prelude-error tests are recast as wire-error-event tests (the new path raises _StreamErrorEvent directly when the wire emits type=error, which is strictly better than the old two-phase "SDK RuntimeError → retry → fallback"). The retry-on-transport-error test moves from responses.stream side-effect to responses.create side-effect. Verified live against chatgpt.com Codex with gpt-5.5 — AIAgent.chat() through the full codex_responses path returns correctly, 319/319 targeted tests passing. * remove Vercel AI Gateway and Vercel Sandbox (#33067) * remove Vercel AI Gateway provider and Vercel Sandbox terminal backend Both Vercel-hosted integrations are removed end-to-end. Users on the AI Gateway should switch to OpenRouter or one of the other aggregators (Nous Portal, Kilo Code). Users on the Vercel Sandbox backend should switch to Docker, Modal, Daytona, or SSH. What's removed: - `plugins/model-providers/ai-gateway/` provider plugin - `hermes_cli/vercel_auth.py` Vercel-Sandbox auth helper - `tools/environments/vercel_sandbox.py` terminal backend - `ai-gateway` provider wiring across auth, doctor, setup, models, config, status, providers, main, web_server, model_normalize, dump - `vercel_sandbox` backend wiring across terminal_tool, file_tools, code_execution_tool, file_operations, approval, skills_tool, environments/local, credential_files, lazy_deps, prompt_builder, cli, gateway/run - `AI_GATEWAY_BASE_URL` constant, `_AI_GATEWAY_HEADERS` auxiliary-client header set, run_agent base-URL header/reasoning special-cases - `[vercel]` pyproject extra and `vercel`/`vercel-workers` from uv.lock - env vars: `AI_GATEWAY_API_KEY`, `AI_GATEWAY_BASE_URL`, `VERCEL_TOKEN`, `VERCEL_PROJECT_ID`, `VERCEL_TEAM_ID`, `VERCEL_OIDC_TOKEN`, `TERMINAL_VERCEL_RUNTIME` - Tests: deletes test_ai_gateway_models.py and test_vercel_sandbox_environment.py; scrubs references across 23 surviving test files (no entire tests deleted unless they were dedicated to AI Gateway / Sandbox) - Docs: provider tables, env-var reference, setup guides, security notes, tool config, terminal-backend tables — English plus zh-Hans i18n parity - `hermes-agent` skill: provider table entry and remote-backend list What stays (intentional): - `popular-web-designs/templates/vercel.md` — CSS design reference, unrelated to Vercel-the-AI-product - `x-vercel-id` in `stream_diag.py` headers — generic Vercel CDN response header, useful diag signal on any Vercel-hosted endpoint - `vercel-labs/agent-browser` URL in browser config — lightpanda browser project, different OSS effort - `userStories.json` historical contributor entry mentioning Vercel Sandbox — archive, not active docs Validation: - 1153 tests in the 22 targeted files pass (`scripts/run_tests.sh`) - Full repo `py_compile` clean - Live import of every touched module + invariant check (no `ai-gateway` in `PROVIDER_REGISTRY`, no `_AI_GATEWAY_HEADERS`, no `vercel_sandbox` in `_REMOTE_TERMINAL_BACKENDS`) * test: convert profile-count check from change-detector to invariant The hardcoded "== 34" assertion broke when ai-gateway was removed. Per AGENTS.md change-detector-test guidance, assert the relationship (registry count >= number of plugin dirs) instead of a literal count. Counts shift when providers are added/removed; that's expected. * feat(api-server): add GET /v1/skills and /v1/toolsets (#33016) Lets external clients enumerate the agent's skills and resolved toolsets deterministically over the OpenAI-compatible API server, without standing up the dashboard web server or sending a chat message and asking the model to list them. - GET /v1/skills — list installed skills (name, description, category) - GET /v1/toolsets — list toolsets resolved for the api_server platform, with enabled/configured state and the concrete tool names each expands to - Both gated by API_SERVER_KEY (same Bearer scheme as every other /v1/* endpoint) - /v1/capabilities advertises both new endpoints Closes the gap a community user just hit asking how to list skills over REST when only the OpenAI-compatible server is running. Test plan - python -m pytest tests/gateway/test_api_server.py -k "Skills or Toolsets or Capabilities" -o 'addopts=' -q → 9/9 pass - python -m pytest tests/gateway/test_api_server.py -o 'addopts=' -q → 156/156 pass, no regressions - E2E: started a real adapter on an isolated HERMES_HOME with a fake skill installed; curl-equivalent calls to /v1/capabilities, /v1/skills, /v1/toolsets returned the expected JSON; unauthenticated calls returned 401 with the configured API_SERVER_KEY. * feat(nix): add #messaging and #full package variants (#33108) * fix(plugins/discord): correct install_hint extra to [messaging] The Discord platform registered install_hint pointing at 'hermes-agent[discord]', but pyproject.toml has no [discord] extra — the deps live in [messaging] alongside Telegram and Slack. Users hitting "Platform 'Discord' requirements not met" were directed at a pip command that installs nothing. * feat(nix): add #messaging and #full package variants Make Discord/Telegram/Slack work out of the box for `nix profile install` users. Messaging deps were dropped from [all] on 2026-05-12 in favor of lazy-install, but lazy-install can't write to the read-only /nix/store — users hit "No adapter available for discord" with no actionable guidance. - #messaging: pre-built with discord.py/telegram/slack (+33 MB venv) - #full: all 18 platform-portable extras + matrix on Linux only (python-olm lacks Darwin PyPI wheels) (+738 MB venv) Also adds a `messaging-variant` flake check that verifies `import discord` succeeds in the sealed venv — regression guard for the lazy-install migration. Docs updated: Quick Start callout, extraDependencyGroups rewrite with messaging as primary example + full extras table, troubleshooting row, cheatsheet row. Closure size deltas (measured x86_64-linux): default 1792 MB pkg / 512 MB venv messaging 1826 MB pkg / 546 MB venv (+33 MB) full 2530 MB pkg / 1250 MB venv (+738 MB) * chore(nix): trim variant comments + alphabetize full extras Drop the date-stamped changelog from messaging-variant's comment and the "+33 MB / +704 MB" numbers from the variant defs — those drift and belong in the PR description, not source. Alphabetize the 18-extra list in #full so future additions produce clean one-line diffs. No semantic change. messaging-variant check still passes. * fix(codex): update silent-hang workaround hint * chore(release): map EvilHumphrey noreply for #33034 salvage * feat: add API server session controls * Support media in session chat API * chore(api-server): mark skills_api capability True now that /v1/skills shipped #33016 added GET /v1/skills + /v1/toolsets on the API server; the capability flag introduced in this branch was placeholder-False. Flip to True so capability probers see the truth. * feat(catalog): add qwen3.7-max to alibaba + alibaba-coding-plan model lists Alibaba's latest flagship Qwen model is released but not yet present in the DashScope (alibaba) or Alibaba Coding Plan curated catalogs. Add it so it shows up in the /model picker and setup wizard for those providers. OpenCode Go routing for qwen3.7-max already landed via #32780 (commit 2fc77c53f). OpenRouter + Nous catalog entries already landed via #32809 (commit ccd3d04fc). This salvage picks up the remaining alibaba / alibaba-coding-plan entries from #32806 — the AI Gateway entry is dropped because Vercel AI Gateway was removed in #33067. * test(codex): cover null output stream terminal events * chore(release): map superearn-fisher noreply for #33122 salvage * plugins: add security-guidance — pattern-matched warnings on dangerous code writes (#33131) New opt-in plugin that scans the content passed to write_file / patch / skill_manage for 25 known-dangerous code patterns — pickle.load, yaml.load, eval(, os.system, subprocess(shell=True), child_process.exec, dangerouslySetInnerHTML, innerHTML/outerHTML/document.write/ insertAdjacentHTML, crypto.createCipher (no IV), AES ECB, TLS verification disabled, XXE-prone xml.etree/minidom parsers, <script src=//...> without SRI, torch.load without weights_only=True, GitHub Actions ${{ github.event.* }} injection — and appends a "Security guidance" warning block to the tool result via the transform_tool_result hook. Default behaviour is non-blocking: the file is written and the warning rides back to the model in the next turn so it can self-correct or document why the construct is safe. SECURITY_GUIDANCE_BLOCK=1 upgrades to refusing the write entirely; SECURITY_GUIDANCE_DISABLE=1 is the kill switch. Pattern data (patterns.py) is a verbatim Apache-2.0 fork of Anthropic's claude-plugins-official/plugins/security-guidance/hooks/ patterns.py at commit 0bde168 (2026-05-26). LICENSE and NOTICE preserve attribution. The Hermes-side plugin glue (__init__.py, plugin.yaml, README.md, tests) is original work. Plugin is opt-in like all bundled plugins: hermes plugins enable security-guidance Inspired by https://x.com/ClaudeDevs/status/1927108527247... — Anthropic shipped this as their security-guidance plugin for Claude Code on 2026-05-26 with a measured 30-40% reduction in security-related PR comments on internal rollout. What's NOT ported (deferred): * Layer 2 (LLM diff review on turn end) — would route through main model by default on Hermes, real money on reasoning models. A follow-up can wire it to a cheap aux model with explicit opt-in. * Layer 3 (agentic commit-time review) — agent can run this on demand via delegate_task today. * .hermes/security-guidance.md project-rules file — only used by layers 2/3 upstream. * test(dashboard): pin current loopback auth behavior as regression harness Phase 0, Task 0.1 of the dashboard-oauth plan. Establishes a baseline for the loopback dashboard's auth surface so future phases can prove they didn't regress the existing _SESSION_TOKEN flow when adding the OAuth gate. * feat(dashboard): add should_require_auth predicate for OAuth gate Phase 0, Task 0.2. Single source of truth for 'is the auth gate active?'. Reuses the existing _LOOPBACK_HOST_VALUES frozenset so this stays in sync with the DNS-rebinding host-header check. RFC1918/CGNAT/link-local are treated as public — exact threat model the gate exists for. * feat(dashboard): stash auth_required flag on app.state Phase 0, Task 0.3. start_server now computes should_require_auth(host, allow_public) and records it on app.state.auth_required BEFORE the existing legacy SystemExit guard fires. This gives middleware, the SPA token-injection path, and WS endpoints a consistent read source for 'is the gate active'. The flag is set but no one reads it yet — Phase 3 registers the gate middleware. Note: 4 pre-existing test failures in tests/hermes_cli/test_web_server.py (PtyWebSocket) + test_update_hangup_protection.py reproduce on pristine HEAD and are unrelated to this change (starlette TestClient WS regression). * feat(dashboard-auth): define DashboardAuthProvider ABC + Session dataclass Phase 1, Task 1.1. New package hermes_cli/dashboard_auth/ contains: base.py - DashboardAuthProvider ABC with 5 abstract methods (start_login, complete_login, verify_session, refresh_session, revoke_session), Session + LoginStart frozen dataclasses, three exception types (ProviderError / InvalidCodeError / RefreshExpiredError), and assert_protocol_compliance() for plugins to call in their own tests. registry.py - Module-level register/get/list/clear with a lock. Nothing reads the registry yet — Phase 2 adds the StubAuthProvider and Phase 3 wires the gate middleware. The plugin hook lands in Task 1.3. * test(dashboard-auth): cover registry register/get/list/clear semantics Phase 1, Task 1.2. Verifies registration order is preserved, duplicate names are rejected with ValueError, and non-compliant providers fail at register time (not later when the middleware tries to dispatch). * feat(plugins): add register_dashboard_auth_provider hook on PluginContext Phase 1, Task 1.3. Mirrors the existing register_image_gen_provider pattern (plugins.py:531) — wrong-type or duplicate-name registrations log at WARNING and silently return rather than raising, so a misbehaving auth plugin cannot crash the host. Deviation from plan: the plan's draft raised TypeError on non-provider input; switched to silent-warn to match the established image_gen convention. Test updated to match. * feat(dashboard-auth): json-lines audit log at $HERMES_HOME/logs/dashboard-auth.log Phase 1, Task 1.4. Records every auth event (login start/success/failure, logout, refresh success/failure, revoke, session verify failure, WS ticket mint) as one JSON object per line. Token-like kwargs (access_token, refresh_token, code, code_verifier, state, ticket, cookie, Authorization) are dropped before serialisation so the log never contains live secrets. Write failures log at WARNING but never raise — auth flows must not fail because the audit logger broke. * test(dashboard-auth): stub auth provider for E2E gate testing Phase 2, Task 2.1. Self-contained fake IDP — start_login redirects straight back to {redirect_uri}?code=stub_code&state=<s> so tests can walk the OAuth round trip in-process. Tokens are HMAC-signed JSON blobs (not real JWTs) — enough structure for verify_session to detect tamper and expiry without pulling in pyjwt. Lives in tests/ only — never registered as a real plugin. Phase 3's end-to-end tests import StubAuthProvider directly. Convention: exp <= now counts as expired (TTL=0 means born-expired) — matches what Phase 6's silent-refresh test will need. * feat(dashboard-auth): cookie helpers for session_at/session_rt/pkce Phase 3, Task 3.1. Three cookies: - hermes_session_at: OAuth access token (HttpOnly, TTL = token TTL) - hermes_session_rt: OAuth refresh token (HttpOnly, 30d max-age) - hermes_session_pkce: PKCE state + verifier + provider hint (10min) All SameSite=Lax + Path=/. Secure flag is set ONLY when the request scheme is https — uvicorn proxy_headers=True (enabled in gated mode at Phase 3.5) rewrites scheme from X-Forwarded-Proto so Fly's TLS terminator works. * feat(dashboard-auth): auth gate middleware + /auth/* routes + /login HTML Phase 3, Tasks 3.2 + 3.3 + 3.4. These three pieces are mutually dependent so they land together. middleware.py - gated_auth_middleware engages when app.state.auth_required is True. Allowlists /login, /auth/*, /api/auth/providers, and static asset paths; everything else demands a valid session_at cookie. Verifies by trying every registered provider's verify_session in turn (multi- provider stack); attaches verified Session to request.state.session. Returns 401 JSON for /api/* and 302 -> /login for HTML. ProviderError during verify -> 503. routes.py - APIRouter with: GET /login server-rendered HTML GET /auth/login?provider=N 302 to IDP + PKCE cookie GET /auth/callback?code,state completes login, sets session cookies POST /auth/logout clears cookies + best-effort revoke GET /api/auth/providers public bootstrap endpoint (503 if zero) GET /api/auth/me verified session as JSON (auth-required) login_page.py - Inline-CSS HTML template, no React, no JavaScript. web_server.py - Mounted gated_auth_middleware between host_header and auth_middleware (FastAPI runs middlewares in registration order: host check -> cookie auth -> token auth). auth_middleware short-circuits when auth_required so cookie auth is authoritative in gated mode. Router is included before mount_spa so the catch-all doesn't swallow /login or /auth/*. 17 new behavioural tests; loopback regression harness still green. * feat(dashboard-auth): fail-closed on no providers; proxy_headers when gated; suppress _SESSION_TOKEN injection Phase 3, Task 3.5. Three changes to web_server.py: 1. start_server replaces the legacy SystemExit-refusing-to-bind guard with: if app.state.auth_required and no providers registered, exit with a clear message; otherwise log the gate-on banner. --insecure keeps its existing behaviour. 2. uvicorn proxy_headers flag is computed from app.state.auth_required. Loopback / --insecure keep it False (so _ws_client_is_allowed sees the real peer for the loopback gate); gated mode flips it True so X-Forwarded-Proto from Fly's TLS terminator is honoured for cookie Secure-flag decisions in detect_https(). 3. _serve_index no longer injects window.__HERMES_SESSION_TOKEN__ when the gate is on — the SPA reads identity from /api/auth/me using cookie auth instead. window.__HERMES_AUTH_REQUIRED__ flag lets the SPA pick between ticket-auth (gated) and token-auth (loopback) for /api/pty + /api/ws (Phase 5 will wire this in the React layer). 4 new behavioural tests; loopback regression harness still green. * docs(dashboard-auth): plan v2 — incorporate Portal OAuth contract (PR #180) Adds a 'Contract Anchor' section at the top of the plan summarizing the 11 material findings from nous-account-service PR #180's published contract. Rewrites Phase 4 (Nous provider) and Phase 6 (re-auth UX) in-place; the v1 drafts are preserved inline marked 'rejected — preserved for archeology' for reviewer context. Phases 0–3 (already shipped) are unaffected — they set up gate engagement and cookie plumbing only. The cookies module's RT cookie becomes dead in Phase 6 task 6.3 and is removed there. Key contract-driven reversals: - client_id is per-instance (agent:{id}), env-injected — not static - audience is bare client_id, not 'hermes-cli:' prefixed - scope is 'agent_dashboard:access' only - JWT claims do NOT include email/name — surface user_id instead - no refresh tokens in V1 — 401 → redirect to /login - JWKS-only verification, no userinfo fallback - redirect_uri is exact-match per AgentInstance, not wildcard Phase 7's AuthWidget needs to display user_id (truncated) instead of email; one-line annotation added at the top of that phase. * feat(dashboard-auth): plugins/dashboard_auth/nous — contract-compliant Nous OAuth provider Bundled, kind=backend, auto-loads. Activates ONLY when Portal-injected env vars are present: HERMES_DASHBOARD_OAUTH_CLIENT_ID — agent:{instance_id} HERMES_DASHBOARD_PORTAL_URL — Portal base URL Loopback / --insecure operators leave both unset and never see this plugin register anything. The fail-closed branch in start_server handles the 'public bind + zero providers' case independently. Implementation follows nous-account-service PR #180's published OAuth contract verbatim: - client_id is per-instance (agent:{instance_id}); the suffix is cross-checked against the token's agent_instance_id claim as defense-in-depth (contract C9). - scope is agent_dashboard:access only (contract C3). - aud is the bare client_id, no hermes-cli: prefix (contract C2). - RS256 JWT verification against /.well-known/jwks.json with 5-minute cache (contract C7). - No refresh tokens in V1: refresh_session always raises RefreshExpiredError; revoke_session is a no-op (contract C5). - oauth_contract_version claim: missing → warn + proceed; present and != 1 → refuse (contract C11, OQ-C2 tolerant treatment). - redirect_uri validated client-side as defense before bouncing to Portal; authoritative check is server-side per agent-redirect-uri.ts. 41 new tests covering construction, plugin-entry env gating, start_login shape, complete_login httpx-mocked happy path + error mapping, verify_session JWT verification (RSA keypair fixture, full claim-check matrix), refresh_session always raising, revoke_session no-op. PyJWT + cryptography are already in the venv (jose was previously suggested; switched to pyjwt[crypto] since the latter is already pulled in transitively). * feat(dashboard-auth): single-use WS tickets + POST /api/auth/ws-ticket Phase 5 task 5.1. Browsers cannot set Authorization on a WebSocket upgrade, so in gated mode the SPA needs an alternative way to bind the upgrade to its authenticated session. hermes_cli/dashboard_auth/ws_tickets.py — in-memory single-use ticket store with 30s TTL. Thread-safe (threading.Lock), token_urlsafe(32) values, ticket value truncated to 8 chars in error messages for log hygiene. Module-level state with _reset_for_tests() helper. hermes_cli/dashboard_auth/routes.py — adds POST /api/auth/ws-ticket. Auth-required (the gate middleware already attaches Session to request.state.session). Returns {ticket, ttl_seconds}; emits WS_TICKET_MINTED audit event with user_id + provider + ip. hermes_cli/dashboard_auth/audit.py — adds WS_TICKET_REJECTED enum value for the consume-side rejection event (wired into the WS endpoints in task 5.2). 11 new tests covering round-trip, single-use, TTL boundary, unknown ticket rejection, secret-hygiene truncation in error messages, and concurrent mint+consume from 20 threads. * feat(dashboard-auth): _ws_auth_ok helper + ticket auth on all 4 WS endpoints Phase 5 task 5.2. Four WebSocket endpoints — /api/pty, /api/ws, /api/pub, /api/events — previously authed with the same constant-time check against `_SESSION_TOKEN`. Replaced with a single helper that branches on `app.state.auth_required`: Loopback / --insecure: legacy ?token=<_SESSION_TOKEN> path (unchanged). Gated: ?ticket=<single-use> consumed against the dashboard-auth ticket store. Critical security property: gated mode UNCONDITIONALLY rejects the ?token= path. A leaked _SESSION_TOKEN value from a log line is not replayable for WS access in gated deployments. `_build_sidecar_url` now branches too: loopback uses the legacy token; gated mode mints a server-internal ticket via mint_ticket() with pseudo-user 'pty-sidecar' / provider 'server-internal' so audit logs can distinguish PTY-internal sidecar tickets from browser tickets. PTY children open /api/pub exactly once at startup so single-use suffices. Ticket rejections audit-log as WS_TICKET_REJECTED with truncated reason + client IP + WS path. Operators debugging 'WS keeps closing' issues see which endpoint and why. 17 new tests: - POST /api/auth/ws-ticket: 200 with cookie, 401/302 without, distinct per call, GET-not-allowed. - _ws_auth_ok loopback: token accept/reject, missing-token reject, ticket-param-ignored. - _ws_auth_ok gated: ticket accept, single-use rejection, unknown reject, legacy-token-rejected-in-gated assertion, audit-log emission. - _build_sidecar_url: loopback uses token=, gated uses ticket=, no-bound returns None. * feat(dashboard-auth): SPA WS auth — getWsTicket() + buildWsAuthParam() Phase 5 task 5.3. The dashboard's three WS-using surfaces (ChatPage, gatewayClient, ChatSidebar) previously hardcoded ?token=<session>. In gated mode the server rejects that path; the SPA must mint a single-use ticket via POST /api/auth/ws-ticket and pass ?ticket= on the upgrade. web/src/lib/api.ts: adds getWsTicket() (POST /api/auth/ws-ticket with credentials: 'include') and buildWsAuthParam() — a helper that returns ['ticket', <minted>] in gated mode and ['token', <session>] in loopback. Window.__HERMES_AUTH_REQUIRED__ is read from the server-injected bootstrap script and toggles the path. Documented as the bridge from cookie auth (REST) to WS auth. web/src/pages/ChatPage.tsx: buildWsUrl() now takes an [authName, authValue] pair instead of a bare token. The WS construct is wrapped in an IIFE so the outer effect can stay synchronous (the cleanup returns the effect's disposer at top level). onDataDisposable + onResizeDisposable hoisted to `let` bindings the cleanup closes over. web/src/lib/gatewayClient.ts: connect() branches on window.__HERMES_AUTH_REQUIRED__ before opening /api/ws. Explicit token overrides win (test-only path); otherwise gated → fetch ticket, loopback → use injected session token. web/src/components/ChatSidebar.tsx: events-feed WS opens through the same IIFE pattern as ChatPage. The ws local is hoisted so the cleanup's ws?.close() works after the async mint resolves. Server side already injects window.__HERMES_AUTH_REQUIRED__ in _serve_index (Phase 3.5). * feat(dashboard-auth): Phase 6 — 401 re-auth envelope + next= propagation Contract V1 of nous-account-service PR #180 ships no refresh tokens, so the original Phase 6 silent-refresh design is replaced with a thinner '401 → redirect to /login' UX. The dashboard's gated middleware now emits a structured envelope on any auth failure; the SPA's fetch wrapper sees it and full-page-navigates the user through re-auth. hermes_cli/dashboard_auth/cookies.py: set_session_cookies(refresh_token='') SKIPS writing the hermes_session_rt cookie. Forward-compat: a non-empty refresh_token still emits the cookie unchanged, so a future Portal contract that starts issuing RTs flips the persistence on with no other change. clear_session_cookies still emits a Max-Age=0 deletion for the RT cookie so stale cookies from earlier deployments get flushed on logout / session expiry. Deprecation marker + rationale in module docstring per the user's docstring-only deprecation pattern. hermes_cli/dashboard_auth/middleware.py: _unauth_response now builds a structured JSON envelope for API 401s: { error: 'session_expired' | 'unauthenticated', detail: 'Unauthorized', reason: <internal>, login_url: '/login?next=<safe-path>' } HTML redirects also carry next= so a user landing on /sessions without a cookie bounces back to /sessions after re-auth. _safe_next_target validates same-origin: drops protocol-relative paths (//evil.com), absolute URLs, and any /login or /auth/* loop. Dead cookies are cleared on the 401 path so the browser stops replaying invalid tokens. hermes_cli/dashboard_auth/routes.py: /auth/callback accepts next= query param and validates via _validate_post_login_target (same rules as the gate's _safe_next_target — defence-in-depth because next= survived a full IDP round trip and attacker-controlled state can re-enter via the callback URL). Open-redirect attempts land at '/' instead. web/src/lib/api.ts: fetchJSON parses the 401 envelope and full-page-navigates to body.login_url ONLY on the known session-expiry error codes. Domain-level 401s (e.g. permission errors) bubble up as regular errors. credentials: 'include' added so cookie auth works for all fetches routed through this wrapper. sessionStorage.lastLocation is preserved for future use by AuthWidget / hermes_status. Test files marked with pytest.mark.xdist_group so the four files that mutate web_server.app.state.auth_required serialize onto the same xdist worker — eliminates 'works locally, fails in CI' app-state bleed. 20 new tests in test_dashboard_auth_401_reauth.py: - set_session_cookies(refresh_token='') skips RT cookie - clear_session_cookies still emits RT deletion - 401 envelope shape (unauthenticated vs session_expired) - dead cookie cleared on invalid-token 401 - login_url carries next= for deep paths - login loop avoided when path is /login/auth/api-auth - protocol-relative URL rejected - _safe_next_target unit tests (accept same-origin, reject loops/abs) - /auth/callback respects safe next= but rejects open redirects 2 pre-existing tests updated to accept the new /login?next=%2F shape. Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing). * feat(dashboard-auth): Phase 7 — SPA AuthWidget + /api/status auth fields Phase 7 surfaces the OAuth gate state to users. web/src/components/AuthWidget.tsx (new): Sidebar widge…

…HOME (#35027 regression) (#38556) The stage2 hook gates the recursive chown of the build trees under $INSTALL_DIR (.venv, ui-tui, node_modules) so a HERMES_UID/PUID remap leaves them writable by the new runtime UID — needed for lazy_deps 'uv pip install' of platform extras (#15012, #21100) and the TUI esbuild rebuild into ui-tui/dist (#28851). #35027 folded that chown under the $HERMES_HOME ownership check ('stat $HERMES_HOME != hermes_uid'). But 'usermod -u <new> hermes' re-chowns the hermes home dir ($HERMES_HOME == /opt/data) to the new UID as a side effect, so after any remap that stat is already satisfied and needs_chown is false — silently skipping the build-tree chown on the common PUID/NAS path. The venv stays owned by the build-time UID (10000), so lazy installs and TUI rebuilds fail with EACCES. Probe the build trees directly instead: chown only when /opt/hermes/.venv is not already owned by the runtime hermes UID. Independent of $HERMES_HOME ownership, idempotent across restarts. Verified live: built the image, booted with HERMES_UID/HERMES_GID on a fresh named volume, confirmed .venv/ui-tui/node_modules end up owned by the remapped UID and 'uv pip install' into the venv succeeds; confirmed the recursive chown fires once and is skipped on restart.

…HOME (NousResearch#35027 regression) (NousResearch#38556) The stage2 hook gates the recursive chown of the build trees under $INSTALL_DIR (.venv, ui-tui, node_modules) so a HERMES_UID/PUID remap leaves them writable by the new runtime UID — needed for lazy_deps 'uv pip install' of platform extras (NousResearch#15012, NousResearch#21100) and the TUI esbuild rebuild into ui-tui/dist (NousResearch#28851). NousResearch#35027 folded that chown under the $HERMES_HOME ownership check ('stat $HERMES_HOME != hermes_uid'). But 'usermod -u <new> hermes' re-chowns the hermes home dir ($HERMES_HOME == /opt/data) to the new UID as a side effect, so after any remap that stat is already satisfied and needs_chown is false — silently skipping the build-tree chown on the common PUID/NAS path. The venv stays owned by the build-time UID (10000), so lazy installs and TUI rebuilds fail with EACCES. Probe the build trees directly instead: chown only when /opt/hermes/.venv is not already owned by the runtime hermes UID. Independent of $HERMES_HOME ownership, idempotent across restarts. Verified live: built the image, booted with HERMES_UID/HERMES_GID on a fresh named volume, confirmed .venv/ui-tui/node_modules end up owned by the remapped UID and 'uv pip install' into the venv succeeds; confirmed the recursive chown fires once and is skipped on restart.

The WhatsApp setup wizard (hermes_cli/main.py, setup_whatsapp) runs `npm install` in /opt/hermes/scripts/whatsapp-bridge at first use to build the Node bridge's node_modules. That dir ships owned by the build UID (10000) via `COPY --chown=hermes:hermes . .`, and stage2-hook.sh's build-tree re-chown after a HERMES_UID/PUID remap covers .venv, ui-tui, gateway and node_modules but not scripts/whatsapp-bridge, so the install fails with `EACCES: permission denied, mkdir '/opt/hermes/scripts/whatsapp-bridge/node_modules'`. Add scripts/whatsapp-bridge to the re-chown set so the wizard's npm install can create node_modules as the remapped hermes user. Same class of EACCES already fixed for .venv (NousResearch#15012, NousResearch#21100), ui-tui (NousResearch#28851) and gateway (NousResearch#27221).

fix: tui needs write permission when not running as root

e55e1d1

deas force-pushed the fix-tui-permission branch from 055b0fd to e55e1d1 Compare May 19, 2026 17:22

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists area/docker Docker image, Compose, packaging comp/tui Terminal UI (ui-tui/ + tui_gateway/) labels May 19, 2026

deas changed the title ~~fix: tui needs write permission when not running as root~~ fix: tui in container needs write permission when not running as root May 20, 2026

benbarclay mentioned this pull request May 27, 2026

fix(docker): chown ui-tui and node_modules on UID remap so TUI esbuild works (#28851) #33045

Merged

benbarclay closed this in #33045 May 27, 2026

benbarclay mentioned this pull request May 27, 2026

feat(docker): upgrade Node to 22 LTS via multi-stage from node:22-bookworm-slim (#4977) #33060

Merged

benbarclay mentioned this pull request May 27, 2026

fix(docker): mkdir HERMES_HOME as root in stage2 before chown / privilege drop (#18488) #33078

Merged

benbarclay mentioned this pull request Jun 4, 2026

fix(docker): chown build trees on UID remap independently of $HERMES_HOME (#35027 regression) #38556

Merged

juniorbra mentioned this pull request Jun 11, 2026

fix(docker): re-chown scripts/whatsapp-bridge on UID remap so WhatsApp setup's npm install works #44487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tui in container needs write permission when not running as root#28851

fix: tui in container needs write permission when not running as root#28851
deas wants to merge 1 commit into
NousResearch:mainfrom
deas:fix-tui-permission

deas commented May 19, 2026 •

edited

Loading

Uh oh!

benbarclay commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deas commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Uh oh!

benbarclay commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deas commented May 19, 2026 •

edited

Loading