convert install.ps1 to new bootstrap protocol by jquesnelle · Pull Request #27224 · NousResearch/hermes-agent

jquesnelle · 2026-05-17T02:18:39Z

Adds an opt-in stage protocol that lets programmatic drivers (the desktop GUI's onboarding wizard, CI, future install.sh parity) drive install.ps1 one step at a time with structured JSON results. Default invocation (irm | iex one-liner) behaves unchanged.

Entry points

    | Flag             | Purpose                                                             |
    |------------------|---------------------------------------------------------------------|
    | install.ps1      | Today's interactive install (unchanged)                             |
    | -ProtocolVersion | Emit protocol version integer                                       |
    | -Manifest        | Emit JSON manifest of available stages                              |
    | -Stage <name>    | Run one stage, emit JSON result                                     |
    | -NonInteractive  | Suppress Read-Host prompts (skips setup wizard + gateway autostart) |
    | -Json            | Machine-readable completion frame                                   |

Manifest exposes 14 stages across prereqs/install/finalize/post-install categories. Two stages (configure, gateway) flag needs_user_input=true so GUI drivers can skip them and handle the equivalent UX themselves.

Adds an opt-in stage protocol that lets programmatic drivers (the desktop GUI's onboarding wizard, CI, future install.sh parity) drive install.ps1 one step at a time with structured JSON results. Default invocation (`irm | iex` one-liner) behaves unchanged. Entry points: install.ps1 Today's interactive install (unchanged) install.ps1 -ProtocolVersion Emit protocol version integer install.ps1 -Manifest Emit JSON manifest of available stages install.ps1 -Stage <name> Run one stage, emit JSON result install.ps1 -NonInteractive Suppress Read-Host prompts (skips the setup wizard and gateway autostart) install.ps1 -Json Machine-readable completion frame Manifest exposes 14 stages across prereqs/install/finalize/post-install categories, with 2 (configure, gateway) flagged needs_user_input=true so GUI drivers can skip them and handle the equivalent UX themselves. Along the way, clean-VM testing on stock Windows 10/11 surfaced a series of latent install.ps1 bugs that were never exercised by developer machines. Fixed in the same commit: * Encoding: file is now pure ASCII with no BOM. Windows PowerShell 5.1 reads BOM-less files as Windows-1252 and chokes on em-dashes (and other UTF-8 sequences), while iex chokes on a leading U+FEFF. Pure-ASCII satisfies both invocation paths. * EAP=Stop + native `2>&1` captures: PowerShell wraps stderr lines from native commands as ErrorRecord objects under EAP=Stop and throws even when the command exits 0. Relaxed to EAP=Continue around the astral.sh uv installer, `uv python install`, `npm install`, `npx playwright install`, the venv import probes, and the Node winget fallback. Check $LASTEXITCODE for the real signal. * Cross-process state: each `-Stage <name>` invocation spawns a fresh powershell child. $script:UvCmd set by Stage-Uv was invisible to Stage-Python; PATH updated by Stage-Git/Stage-Node was invisible to subsequent stages spawned by the driver shell. Added Resolve-UvCmd helper called at the top of every stage that needs uv, and a Sync-EnvPath helper called at the top of Invoke-Stage to refresh PATH from the registry. * UAC avoidance: `winget install OpenJS.NodeJS.LTS` triggers a UAC prompt that often appears minimized in the taskbar -- looks like a hang. Switched Test-Node to prefer the official portable Node zip dropped into %LOCALAPPDATA%\hermes\node\ (mirrors the PortableGit pattern Install-Git already uses). winget kept as fallback. * npx hangs on confirmation: `npx playwright install chromium` blocks on stdin waiting for "Need to install playwright@X.Y.Z (y/N)" when playwright isn't in local node_modules. Tee-Object pipelines disconnect stdin from the user's TTY so the install hangs forever. Pass `--yes` to auto-accept. * Silent long-running installs: `*> $logPath` redirected every stream to disk and left the user staring at a frozen "Installing..." line for the 5-10 minutes Playwright Chromium takes to download. Switched to `2>&1 | ForEach-Object { "$_" } | Tee-Object -FilePath $log` so output streams live to the console AND captures to log for failure diagnostics. ForEach-Object coercion strips PowerShell's red NativeCommandError formatter from stderr items. * Console encoding: forced [Console]::OutputEncoding to UTF-8 so playwright/git/npm progress bars, box-drawing, and check marks render correctly instead of as IBM437/Windows-1252 mojibake. * Performance: set $ProgressPreference = "SilentlyContinue" so Invoke-WebRequest doesn't paint its per-chunk progress bar. The PS 5.1 progress UI throttles downloads by 10-100x (a 57MB PortableGit grab takes 5 minutes with the bar on vs ~20 seconds with it off, same network). Affects PortableGit, Node portable zip, and the Hermes repo zip fallback. Tests: scripts/tests/test-install-ps1-stage-protocol.ps1 provides 19 metadata-only assertions covering -ProtocolVersion, -Manifest schema, and unknown -Stage error frame. No install side effects. End-to-end validated on a clean Windows 10 VM via: 1. `irm <branch>/scripts/install.ps1 | iex` (canonical CLI path) 2. `powershell -File install.ps1 -Stage X` iterated through every stage (GUI driver path, exercises cross-process fixes)

Copilot

Pull request overview

Adds an opt-in "stage protocol" to scripts/install.ps1 so programmatic drivers (the desktop GUI's onboarding wizard, CI, future install.sh) can drive the Windows installer one step at a time and receive structured JSON results, while the default interactive irm | iex flow is unchanged. The PR also replaces non-ASCII glyphs in console output with ASCII equivalents (for PS 5.1 parser/codepage robustness), forces UTF-8 console output for child commands, silences Invoke-WebRequest progress bars, demotes the winget Node install behind the portable-zip path, and adds several $ErrorActionPreference relaxations around native commands whose stderr would otherwise be wrapped as terminating errors.

Changes:

New stage protocol surface in install.ps1: -ProtocolVersion, -Manifest, -Stage <name>, -NonInteractive, -Json, plus a 14-stage $InstallStages table, per-stage workers, Invoke-Stage/Invoke-AllStages, and Resolve-UvCmd/Sync-EnvPath helpers for cross-process driver mode.
Operational hardening: UTF-8 console encoding, $ProgressPreference = SilentlyContinue, EAP relaxations around uv, winget, python -c, npm install, and playwright install, Tee-Object live output for npm/playwright, persisted User-PATH entry for the portable Node install.
New PowerShell smoke test (scripts/tests/test-install-ps1-stage-protocol.ps1) covering -ProtocolVersion, -Manifest shape and required stage names, and unknown-stage error framing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
scripts/install.ps1	Adds the stage-protocol parameters, dispatch block, stage table/workers, helper functions, and various EAP/UTF-8/Tee operational changes; replaces non-ASCII banner/log glyphs with ASCII.
scripts/tests/test-install-ps1-stage-protocol.ps1	New metadata-only smoke test that runs `-ProtocolVersion`, `-Manifest`, and an unknown `-Stage` and asserts exit codes plus JSON shape.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    } catch {
+        $result.ok = $false
+        $result.reason = "$_"
+        throw
+    } finally {
+        $result.duration_ms = [int]([DateTime]::UtcNow - $start).TotalMilliseconds
+        if ($Json -or $Stage) {
+            # In stage-driver mode every stage emits a JSON line so the
+            # caller can stream progress.  In default interactive mode we
+            # stay silent here (the worker already wrote human output).
+            $result | ConvertTo-Json -Compress | Write-Output
+        }
+    }


+    if ($Json -or $Stage) {
+        # Stage-driver mode: caller wants JSON they can parse.  Emit a
+        # structured error frame and exit non-zero.
+        $err = @{
+            ok     = $false
+            stage  = if ($Stage) { $Stage } else { $null }
+            reason = "$_"
+        }
+        $err | ConvertTo-Json -Compress | Write-Output
+        exit 1
+    }
+


+        # Restore EAP in case the try block threw before the assignment
+        if ($prevEAP) { $ErrorActionPreference = $prevEAP }
+        Write-Err "Failed to install uv: $_"


alt-glitch

Inline comments on the two bugs found during Windows E2E testing.

Three issues flagged by the Copilot review on this PR: 1. Double JSON emit on stage failure (Copilot #1, #2). When -Stage <name> ran a worker that threw, Invoke-Stage's finally emitted a JSON result frame AND the entry-point catch emitted a second error frame -- producing two concatenated JSON objects on stdout and breaking the one-line-per-invocation contract that drivers parse against. Same issue applied to -Json mode on a full install (every stage's finally plus a final error frame missing duration_ms/skipped). Fix: Invoke-Stage's finally now sets $script:_StageEmittedErrorFrame when it emits a failure frame; the entry-point catch checks the flag and skips its own emit, still exit 1. 2. $prevEAP uninitialized on early try-block throw (Copilot #3). In Install-Uv, Test-Python, Test-Node's winget fallback, _Run-NpmInstall, and the playwright block, '$prevEAP = $ErrorActionPreference' lived as the first statement INSIDE the try. If anything between 'try {' and that line threw (Write-Info on an unusual host, the npx-finding loop, etc.), the catch's 'if ($prevEAP) { ... }' restore was a no-op and EAP could remain relaxed. Fix: hoist '$prevEAP = $ErrorActionPreference' to the line immediately before 'try {' in all five sites. Catch's restore is now always meaningful regardless of where in the try the throw originated. No change to Invoke-Stage's success path or to the four lint-clean EAP sites (Test-Node was the only winget-related catch). All 19 metadata smoke tests still pass.

alt-glitch

🐛 Request Changes: Double JSON on failed stages

Tested on Windows 11 PS 5.1.26100 via SSH. The stage protocol metadata surface is clean (19/19 smoke tests pass, cross-process driver works for uv/python/git/node), but there's a protocol contract bug that will bite programmatic drivers:

When a stage fails, the caller gets two JSON lines instead of one:

{"skipped":false,"ok":false,"reason":"Cannot find path ...","duration_ms":46,"stage":"venv"}
{"ok":false,"reason":"Cannot find path ...","stage":"venv"}

Root cause: Invoke-Stage emits JSON in its finally block (line 1987), then re-throws (line 1980). The outer catch at line 2062 also emits a JSON error frame. Both fire for the same failure.

Suggested fix — suppress the re-throw when running a single stage, since the JSON frame from finally is already the structured result:

# In Invoke-Stage, line 1980:
    } catch {
        $result.ok = $false
        $result.reason = "$_"
        if (-not $Stage) { throw }  # Only re-throw in full-install mode
    } finally {

The full-install path (Main → Invoke-AllStages) still needs the re-throw so the outer catch can surface the error to the user. Single-stage mode (-Stage venv) doesn't — the JSON frame with ok=false is the contract, and the exit code from the finally→catch chain already propagates correctly.

Also two cosmetic nits (not blocking):

Completion banner line 1756 is 62 chars wide vs 59-char borders ([OK] is 3 wider than ✓, trailing spaces not adjusted)
Test file line 8 has an em-dash (—) — only non-ASCII byte across both files

alt-glitch

Adversarial review round 1 — two new bugs found (in addition to the double-JSON already flagged).

Consolidating into single review

alt-glitch

Review — PR #27224: install.ps1 stage protocol + Windows clean-VM hardening

Tested on real Windows 11 PS 5.1.26100 via SSH (sidbin@vespyr), plus two parallel adversarial Claude Code Sonnet reviewers with independent focus areas.

Testing performed

Test	Result
Smoke tests (19 assertions: -ProtocolVersion, -Manifest schema, unknown -Stage)	Pass
Cross-process stage driver (uv/python/git/node as separate PS processes)	Pass
`-NonInteractive` suppresses Read-Host (configure stage returns immediately)	Pass
Encoding: pure ASCII, no BOM (main had 1,409 non-ASCII bytes)	Pass
Existing pytest suite (`test_windows_native_support.py`, 58 tests)	Pass
Failed stage JSON framing	Bug
Stage-Node success reporting	Bug
Empty `-Stage ""` dispatch	Bug

What's good

The stage protocol design is clean — single source of truth, thin stage workers, Resolve-UvCmd + Sync-EnvPath for cross-process state. The hardening fixes (EAP guards, ProgressPreference 10-100x speedup, portable Node over winget UAC, npx --yes, Tee-Object streaming, console encoding) are all correct and well-documented. The commit message is excellent — every fix traced to a specific clean-VM failure.

Bugs — see inline comments

Double JSON on failed stages (HIGH) — line 1980
Stage-Node reports ok=true when Node install fails (MEDIUM) — line 1916
-Stage "" runs full install instead of erroring (MEDIUM) — line 2043
Completion banner misalignment (LOW) — line 1756

Adversarial findings rejected (verified false on PS 5.1)

ConvertTo-Json -Compress multi-line → single-line confirmed
Move-Item fails with spaces → works fine
PATH duplication in Sync-EnvPath → cosmetic, safe
Concurrent stage PATH race → stages are sequential by design

Ready to ship once bugs 1-3 are addressed.

alt-glitch · 2026-05-17T05:37:09Z

+    } catch {
+        $result.ok = $false
+        $result.reason = "$_"
+        throw


Bug 1: Double JSON on failed stages.

Invoke-Stage emits JSON in the finally block (line 1987), then re-throws here. The outer catch at line 2062 also emits a JSON error frame. Drivers get two lines for one failure:

{"skipped":false,"ok":false,"reason":"...","duration_ms":46,"stage":"venv"} {"ok":false,"reason":"...","stage":"venv"}

Reproduced on PS 5.1.26100 with -Stage venv (no repo at default path).

Fix — suppress re-throw in single-stage mode:

if (-not $Stage) { throw } # Only re-throw in full-install mode

alt-glitch · 2026-05-17T05:37:09Z

+function Stage-Uv               { if (-not (Install-Uv))     { throw "uv installation failed" } }
+function Stage-Python           { Resolve-UvCmd; if (-not (Test-Python))    { throw "Python $PythonVersion not available" } }
+function Stage-Git              { if (-not (Install-Git))    { throw "Git not available and auto-install failed -- install from https://git-scm.com/download/win then re-run" } }
+function Stage-Node             { [void](Test-Node) }


Bug 2: Stage-Node always reports ok=true even when Node install fails.

Test-Node returns $true unconditionally (line 684). Stage-Node does [void](Test-Node) — never throws. In stage-driver mode the JSON says ok=true when Node isn't actually installed.

The default-install path works because $script:HasNode gates downstream behavior, but the stage protocol contract lies to the GUI driver.

Fix — surface the failure:

function Stage-Node { [void](Test-Node); if (-not $script:HasNode) { throw "Node.js not available (optional — browser tools disabled)" } }

Or if the GUI should treat this as a soft skip, set $result.skipped = $true instead of throwing.

alt-glitch · 2026-05-17T05:37:10Z

+        exit 0
+    }
+
+    if ($Stage) {


Bug 3: -Stage "" silently runs a full install.

PS treats "" as falsy, so if ($Stage) is false and dispatch falls through to Main. A GUI driver passing empty string gets an entire interactive install instead of an error.

Verified on PS 5.1.26100: "" = falsy, " " = truthy.

Fix — use $PSBoundParameters instead of truthy test:

if ($PSBoundParameters.ContainsKey('Stage')) { if ([string]::IsNullOrWhiteSpace($Stage)) { @{ ok = $false; stage = $Stage; reason = "Stage name cannot be empty." } | ConvertTo-Json -Compress | Write-Output exit 2 } # ... existing dispatch }

alt-glitch · 2026-05-17T05:37:10Z

-    Write-Host "│              ✓ Installation Complete!                   │" -ForegroundColor Green
-    Write-Host "└─────────────────────────────────────────────────────────┘" -ForegroundColor Green
+    Write-Host "+---------------------------------------------------------+" -ForegroundColor Green
+    Write-Host "|              [OK] Installation Complete!                   |" -ForegroundColor Green


Nit: banner misalignment. [OK] is 3 chars wider than ✓ but trailing spaces weren't reduced. This line is 62 chars, borders are 59.

+---------------------------------------------------------+ (59) | [OK] Installation Complete! | (62) +---------------------------------------------------------+ (59)

Address the two cosmetic items from review: - Completion banner middle line was 62 chars vs 59-char top/bottom borders (replacing the 1-char checkmark with [OK] added width that wasn't reflected in the trailing whitespace). Drop 3 trailing spaces. - Smoke test file had a single em-dash in a comment -- the only non-ASCII byte across both files. Replace with -- for consistency with install.ps1's pure-ASCII goal.

Bug #1 (double JSON) was already addressed in this PR via the $script:_StageEmittedErrorFrame guard at lines 2005-2007 and 2091 — reviewer was looking at an older revision. Bug #2 (banner width) and Nit #1 (em-dash in test) just fixed in 9eb9bee.

github-actions · 2026-05-17T05:53:18Z

🔎 Lint report: `jq/install-ps1-stage-protocol` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8350 on HEAD, 8350 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4366 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Two protocol-correctness gaps from review: 1. Stage-Node used [void](Test-Node) which discarded Test-Node's return value, so the JSON frame always reported ok=true even when Node install fully failed. A GUI driver consuming the manifest couldn't tell 'node ready' from 'node missing'. Wire a soft-skip channel ($script:_StageSkippedReason) that workers can populate to surface 'ran, but the thing it was supposed to set up is not available' as skipped=true with a reason in the JSON, without aborting the install (Node is optional -- browser tools degrade gracefully, matches Write-Completion's existing 'Note: Node.js could not be installed' behavior). Reset before each stage so a prior reason can't leak. 2. The -Stage dispatch used 'if ($Stage)' which is falsy for empty string, so 'install.ps1 -Stage ""' fell through to Main and silently kicked off a full destructive install. Switch to PSBoundParameters.ContainsKey('Stage') so an explicit empty value surfaces as unknown-stage exit 2 with a structured JSON frame, the way every other bad stage name does.

Three issues flagged by the Copilot review on this PR: 1. Double JSON emit on stage failure (Copilot #1, #2). When -Stage <name> ran a worker that threw, Invoke-Stage's finally emitted a JSON result frame AND the entry-point catch emitted a second error frame -- producing two concatenated JSON objects on stdout and breaking the one-line-per-invocation contract that drivers parse against. Same issue applied to -Json mode on a full install (every stage's finally plus a final error frame missing duration_ms/skipped). Fix: Invoke-Stage's finally now sets $script:_StageEmittedErrorFrame when it emits a failure frame; the entry-point catch checks the flag and skips its own emit, still exit 1. 2. $prevEAP uninitialized on early try-block throw (Copilot #3). In Install-Uv, Test-Python, Test-Node's winget fallback, _Run-NpmInstall, and the playwright block, '$prevEAP = $ErrorActionPreference' lived as the first statement INSIDE the try. If anything between 'try {' and that line threw (Write-Info on an unusual host, the npx-finding loop, etc.), the catch's 'if ($prevEAP) { ... }' restore was a no-op and EAP could remain relaxed. Fix: hoist '$prevEAP = $ErrorActionPreference' to the line immediately before 'try {' in all five sites. Catch's restore is now always meaningful regardless of where in the try the throw originated. No change to Invoke-Stage's success path or to the four lint-clean EAP sites (Test-Node was the only winget-related catch). All 19 metadata smoke tests still pass.

…sh,ps1} Eliminates 687 lines of duplicated browser bootstrap code by routing all bootstrap paths through dep_ensure.py -> install.{sh,ps1} --ensure. install.sh: - New ensure_browser() with agent-browser + camofox install, system browser detection + .env writing, per-distro Playwright deps (apt/arch/fedora/suse) - macOS app-bundle paths added to find_system_browser() - configure_browser_env_from_system_browser() creates .env if missing - postinstall_mode() uses ensure_browser() instead of inline duplication install.ps1: - New -Ensure and -PostInstall params (coexists with stage protocol) - New functions: Resolve-NpmCmd, Resolve-NpxCmd, Find-SystemBrowser, Write-BrowserEnv, Install-AgentBrowser (with -SkipPlaywright) - Invoke-EnsureMode dispatches node/browser/ripgrep/ffmpeg - Invoke-PostInstallMode runs full post-pip-install bootstrap - ErrorActionPreference guards on all native command calls - ASCII-only convention maintained (no Unicode) - Mutual exclusion guard: -Ensure + -Stage = error dep_ensure.py: - Windows-aware: _IS_WINDOWS, _find_install_script returns (path, shell) tuple - PowerShell invocation with powershell/pwsh guard + -ExecutionPolicy Bypass - _has_hermes_agent_browser() checks platform-correct paths - _has_system_browser() checks Windows browser names (chrome, msedge, chromium) - env_extra parameter for forwarding install flags config.py: - stamp_install_method() writes ~/.hermes/.install_method - detect_install_method() checks stamp first (before heuristics) acp_adapter: - _run_setup_browser() rewritten: ensure_dependency('node') + ensure_dependency('browser') - acp_adapter/bootstrap/ deleted (399 + 288 lines) Rebased onto main -- drops #26620 dependency (upstream stage protocol merged via #27224). Closes follow-up from #26593.

The canonical install flow irm https://raw.githubusercontent.com/.../scripts/install.ps1 | iex fails on PowerShell 5.1 with a cascade of 'The assignment expression is not valid' errors at every param() default value: [string]$Branch = 'main', ~~~~~~ The assignment expression is not valid. The input to an assignment operator must be an object that is able to accept assignments... Root cause: scripts/install.ps1 carries a UTF-8 BOM (0xEF 0xBB 0xBF) as its first three bytes. 'irm' returns the response body as a string; on PS 5.1 the BOM survives into that string as a leading \ufeff character. 'iex' then evaluates the string and PS's parser chokes on the invisible character before param() -- error recovery proceeds into the body but every assignment is reported as broken. This was the exact failure mode the install.ps1 hardening pass (PR #27224) deliberately fixed by stripping the BOM and ensuring the file body is pure ASCII. Commit 4279da4 ('fix(windows): make PowerShell installer parse in 5.1') re-introduced the BOM later, unintentionally undoing the irm|iex compatibility fix; the merge that brought it into bb/gui carried it forward. Fix: strip the three BOM bytes. File body is verified pure ASCII (any-byte > 127 returns false), so PS 5.1 with no BOM falls back to Windows-1252 decoding which is identical to ASCII for our content. Both install paths now work: - 'irm ... | iex' (canonical CLI) - 'powershell -File install.ps1' (programmatic / desktop bootstrap)

… ops Three install.ps1 improvements pulled from the thin-installer work on bb/gui (PR NousResearch#27822) that benefit the canonical CLI install flow on main: 1. Strip UTF-8 BOM from scripts/install.ps1. The canonical 'irm <raw URL> | iex' install flow has been broken since commit 4279da4 re-introduced a UTF-8 BOM that PR NousResearch#27224 had explicitly stripped. PowerShell 5.1's 'irm' returns the response body as a string with the BOM surviving as a leading \ufeff character; 'iex' then evaluates that string and the parser chokes on the invisible character before param(), surfacing as a cascade of 'The assignment expression is not valid' errors at every param default value. File body is verified pure ASCII (no character above byte 127), so PS 5.1 with no BOM falls back to Windows-1252 decoding which is identical to ASCII for our content. Both install paths work: - 'irm ... | iex' (canonical one-liner) - 'powershell -File install.ps1' (programmatic / desktop bootstrap) 2. New -Commit and -Tag string params for reproducible pinning. Higher-precedence variants of -Branch. When set, the repository stage clones $Branch (fast partial fetch) and then 'git checkout's the exact ref. Precedence: Commit > Tag > Branch. Honoured by all three code paths: - Update path (existing valid checkout): fetch + checkout --detach <commit|tag> instead of checkout + pull. - Fresh clone: clone --branch $Branch, then post-clone 'git checkout --detach' to the requested ref. - ZIP fallback: pick archive URL for the most-specific ref (commit -> archive/<sha>.zip, tag -> archive/refs/tags/ <tag>.zip, else archive/refs/heads/<branch>.zip). Used by the Hermes desktop's first-launch bootstrap to pin the .exe to the exact commit it was built against, so the cloned Hermes Agent tree always matches what the .exe was tested with. Also enables release-bundle pinning (e.g. Microsoft Store builds pinning to a release tag) and CI reproducibility. 3. EAP=Continue wrap around the new pin-step git invocations. 'git fetch origin <commit>' writes the routine 'From <url>' info line to stderr. Under the script's global $ErrorActionPreference = 'Stop' that stderr line is wrapped as an ErrorRecord and terminates the script even though fetch+checkout actually succeed. Same EAP=Stop + native-stderr footgun we hit during the install.ps1 hardening pass in Install-Uv, Test-Python, _Run-NpmInstall. Wrap both the update-path fetch/checkout block AND the post-clone pin block in $ErrorActionPreference = 'Continue' (restored in finally). Real failures still caught by $LASTEXITCODE checks.

* fix(acp): treat polished tool error payloads as failed * fix(acp): also mark raised-exception tool results as failed Extends #26573 to also catch the case the original PR deliberately left out: when a tool raises an exception, the agent's tool executor wraps it in a canonical 'Error executing tool '<name>': ...' string prefix (see agent/tool_executor.py around the try/except). That prefix is unique to the wrapper and cannot legitimately appear in well-behaved tool output, so it is a safe signal that the tool blew up. Without this, the canonical 'tool raised' case still rendered as a green 'completed' row in Zed despite being a runtime failure — exactly the class of bug #26573 set out to fix. Adds a positive test (raised-exception prefix -> failed) and a negative test (bare 'Error:' word in legit tool output stays completed) so a future contributor doesn't accidentally widen the rule to false-positive on compiler/linter diagnostics. * fix(acp): refresh session info after auto-title * fix(acp): use refresh moment as updated_at on session info push Follow-up to #26543. The sessions table does not have an updated_at column (see hermes_state.py — only started_at/ended_at), so row.get('updated_at') always returned None and the str() coercion was dead code. Use datetime.now(UTC).isoformat() instead, which reflects exactly what the field means here: 'the title was refreshed at this moment'. Drop the dead coercion. * feat(acp): enrich permission request cards * feat(web): mobile dashboard UX polish (#28127) * feat(web): mobile dashboard UX polish Bottom sheets for sidebar theme/language pickers on narrow viewports with enter/exit animation and drag-to-close; inline header badges beside titles; bottom padding on the route outlet for scroll clearance; profiles loading uses a unicode braille spinner; align profile/cron card actions to the top; viewport-fit cover and supporting layout tweaks across dashboard pages. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Nix web npm hash and mobile sheet accessibility. Align fetchNpmDeps in nix/web.nix with web/package-lock.json for CI. Improve BottomPickSheet backdrop labeling, avoid aria-hidden on the dialog during exit animation, and wire theme/language sheets with listbox semantics and localized dismiss labels. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(install.ps1): strip BOM, add -Commit/-Tag pin params, harden git ops Three install.ps1 improvements pulled from the thin-installer work on bb/gui (PR #27822) that benefit the canonical CLI install flow on main: 1. Strip UTF-8 BOM from scripts/install.ps1. The canonical 'irm <raw URL> | iex' install flow has been broken since commit 4279da4db re-introduced a UTF-8 BOM that PR #27224 had explicitly stripped. PowerShell 5.1's 'irm' returns the response body as a string with the BOM surviving as a leading \ufeff character; 'iex' then evaluates that string and the parser chokes on the invisible character before param(), surfacing as a cascade of 'The assignment expression is not valid' errors at every param default value. File body is verified pure ASCII (no character above byte 127), so PS 5.1 with no BOM falls back to Windows-1252 decoding which is identical to ASCII for our content. Both install paths work: - 'irm ... | iex' (canonical one-liner) - 'powershell -File install.ps1' (programmatic / desktop bootstrap) 2. New -Commit and -Tag string params for reproducible pinning. Higher-precedence variants of -Branch. When set, the repository stage clones $Branch (fast partial fetch) and then 'git checkout's the exact ref. Precedence: Commit > Tag > Branch. Honoured by all three code paths: - Update path (existing valid checkout): fetch + checkout --detach <commit|tag> instead of checkout + pull. - Fresh clone: clone --branch $Branch, then post-clone 'git checkout --detach' to the requested ref. - ZIP fallback: pick archive URL for the most-specific ref (commit -> archive/<sha>.zip, tag -> archive/refs/tags/ <tag>.zip, else archive/refs/heads/<branch>.zip). Used by the Hermes desktop's first-launch bootstrap to pin the .exe to the exact commit it was built against, so the cloned Hermes Agent tree always matches what the .exe was tested with. Also enables release-bundle pinning (e.g. Microsoft Store builds pinning to a release tag) and CI reproducibility. 3. EAP=Continue wrap around the new pin-step git invocations. 'git fetch origin <commit>' writes the routine 'From <url>' info line to stderr. Under the script's global $ErrorActionPreference = 'Stop' that stderr line is wrapped as an ErrorRecord and terminates the script even though fetch+checkout actually succeed. Same EAP=Stop + native-stderr footgun we hit during the install.ps1 hardening pass in Install-Uv, Test-Python, _Run-NpmInstall. Wrap both the update-path fetch/checkout block AND the post-clone pin block in $ErrorActionPreference = 'Continue' (restored in finally). Real failures still caught by $LASTEXITCODE checks. * fix: add default base_url_override for ollama-cloud provider * chore(release): add AUTHOR_MAP entry for falasi * feat(cli): add /update slash command to CLI and TUI (#23854) * feat: add /update slash command to CLI and TUI * test(cli): add Python tests for /update slash command Co-authored-by: Cursor <cursoragent@cursor.com> * fix(cli): address Copilot review for /update slash command Route classic CLI /update through prompt_toolkit modal confirmation and defer relaunch to the main-thread cleanup path after app.exit(). Tighten Y/n semantics, add Python wrapper and catalog coverage tests, and assert /update stays visible in the TUI command catalog. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(cli): address review feedback on /update command - Replace raw input() with _prompt_text_input_modal in _handle_update_command to avoid EOF/hang/keystroke-leak races with prompt_toolkit's stdin ownership - Fix confirmation logic: only proceed on recognized affirmative aliases (y/yes/1/ok); cancel on everything else including empty string, typos, and unrecognized input — matches all other [Y/n] prompts in the codebase - Route relaunch through main-thread shutdown path: set _pending_relaunch and return False from process_command so process_loop triggers app.exit(); run() then calls relaunch() after prompt_toolkit has restored terminal modes and after cleanup — safe on both POSIX (execvp) and Windows (subprocess+exit) - Fix misleading docstring in test_update_command.py: the Vitest only covers the TypeScript slash handler that emits code 42, not the Python wrapper branch that acts on it - Rewrite tests to use SimpleNamespace pattern (like test_destructive_slash_confirm) so _prompt_text_input_modal can be stubbed directly - Add Python test for _launch_tui exit-code-42 → relaunch branch in main.py Agent-Logs-Url: https://github.com/NousResearch/hermes-agent/sessions/f6da68cf-e7b1-4b7a-aed6-3d4b0f523bdb Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com> * fix(cli): polish test fixtures for /update command - Remove unused _prompt_text_input from SimpleNamespace stub - Use pytest.fail sentinel in managed-install guard test to catch unexpected modal invocations Agent-Logs-Url: https://github.com/NousResearch/hermes-agent/sessions/f6da68cf-e7b1-4b7a-aed6-3d4b0f523bdb Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com> * chore: re-trigger CI after Copilot review fixes Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com> * feat(skills): add baoyu-article-illustrator skill * feat(skills): adapt baoyu-article-illustrator for Hermes Adapts the upstream baoyu-article-illustrator skill (verbatim-copied in the previous commit) to Hermes' tool ecosystem, matching the pattern used by baoyu-infographic. - Metadata: openclaw → hermes; add author, license, tags, category - Triggering: slash command + CLI flags → natural language - User config: remove EXTEND.md, first-time-setup, preferences-schema - User prompts: AskUserQuestion (batched) → clarify (one at a time) - Image gen: baoyu-imagine → image_generate (describe refs in prompt text) - Platform: drop Windows/PowerShell; Linux/macOS only - File ops: switch to write_file / read_file - Watermark: opt-in per-article instead of EXTEND.md-driven - Add PORT_NOTES.md describing the adaptation and sync procedure Style, palette, and prompt/system.md reference files are verbatim copies and are the sync points with upstream. * fix(skills): align article-illustrator with real Hermes tool capabilities Addresses review feedback on #13193: 1. Reference-image flow no longer assumes write_file/read_file handle binaries. vision_analyze produces a textual description; the binary is optionally copied via terminal (cp/curl). The description is what gets embedded in prompts. 2. image_generate's URL-only return is now explicit. Step 6 downloads the returned URL to local disk via terminal (curl -sSL -o ...), then verifies non-zero size before proceeding. 3. Removed "Please use nano banana pro..." line from prompts/system.md — the backend is user-configured and not agent-selectable, so routing hints in the prompt are misleading. PORT_NOTES.md updated: prompts/system.md is no longer verbatim, and the file-ops/backend-selection rows now reflect Hermes' actual tool surface (write_file/read_file for text, terminal for binaries and URL downloads, vision_analyze for reading images). * chore(skills/baoyu-article-illustrator): tighten description, add platforms, regen docs * chore(release): map Jack Yang contributor email Adds the contributor email mapping for Jack Yang (@0xjackyang) so future release-note generation attributes commits correctly. Salvage of #27964 by @0xjackyang. * chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 7 Pre-stages AUTHOR_MAP entries for 5 new contributors whose PRs are being salvaged in the May 2026 low-hanging-fruit batch (group 7). Lands ahead of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI. Contributors: - 02356abc (#28286 — wecom WSMsgType.CLOSING) - burjorjee (#28201 — inline-shell timeout guard) - oseftg (#28168 — natural response ending: emoji + caret) - rudi193-cmd (#28241 — empty credential pool entries) - sadiksaifi (#27982 — kanban horizontal scroll) Per references/batch-pr-salvage-may14-additions.md. * fix(wecom): handle WSMsgType.CLOSING to prevent CPU spin The WeCom adapter's _read_events() loop only handled CLOSE, CLOSED, and ERROR websocket message types. When the server initiates a graceful shutdown, aiohttp returns WSMsgType.CLOSING before the connection is fully closed. This message type was not handled, causing the receive() call to return immediately in a tight loop while self._ws.closed remained False. The result was 100% CPU usage on the asyncio event loop. Add WSMsgType.CLOSING to the set of terminal message types that raise RuntimeError("WeCom websocket closed"), allowing _listen_loop() to enter its normal reconnect backoff path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(auth): treat empty credential pool entries as unauthenticated Fixes #28140 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: include hermes_plugins in gateway.log component filter gateway.log uses a _ComponentFilter that only passes records from loggers starting with ('gateway',). Plugin modules are loaded under the hermes_plugins.* namespace, so all plugin log output is silently dropped from gateway.log. This makes plugin registration — which directly affects gateway hooks (pre_gateway_dispatch, transform_llm_output, etc.) — invisible in the gateway-specific log. Operators debugging gateway behavior check gateway.log and see no plugin activity, even when plugins are working correctly. Add 'hermes_plugins' to the gateway component prefixes tuple so plugin log messages appear in gateway.log. Closes #28138 * fix(gateway): align kanban artifact _IMAGE_EXTS with response dispatch _deliver_kanban_artifacts used a broader _IMAGE_EXTS that included .bmp, .tiff, and .svg. These three extensions are absent from the equivalent set in _deliver_media_from_response (line 10661), which intentionally routes them through send_document rather than send_multiple_images (comment near line 10522 notes that Telegram sendPhoto recompresses and rejects non-raster formats). Routing .svg (XML text), .bmp, or .tiff through the photo API causes send_multiple_images to raise on most platforms; the exception is caught and logged as a warning, silently dropping the artifact. Aligning the two sets ensures kanban deliverables with these extensions follow the same send_document path as regular agent responses. No behaviour change for .png/.jpg/.jpeg/.gif/.webp. * fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze Background process non-PTY path used stdin=subprocess.PIPE unconditionally, creating an orphan pipe that was never written to and never closed. Child processes that read stdin would block indefinitely, competing with the parent's prompt_toolkit event loop for terminal ownership and causing complete keyboard lockout. Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin reads instead of blocking forever. For interactive stdin, the PTY path (which has its own independent PTY via ptyprocess.PtyProcess.spawn) should be used instead. Fixes #17959 * chore(release): alias stale-ID salvage commit for @LifeJiggy (#28317) * fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze Background process non-PTY path used stdin=subprocess.PIPE unconditionally, creating an orphan pipe that was never written to and never closed. Child processes that read stdin would block indefinitely, competing with the parent's prompt_toolkit event loop for terminal ownership and causing complete keyboard lockout. Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin reads instead of blocking forever. For interactive stdin, the PTY path (which has its own independent PTY via ptyprocess.PtyProcess.spawn) should be used instead. Fixes #17959 * chore(release): alias stale-ID salvage commit for LifeJiggy PR #28315 was salvaged with a wrong noreply numeric ID (192385615 vs the correct 141562589). The commit on main is correctly authored to LifeJiggy by username, but the noreply email doesn't match AUTHOR_MAP. Adds an alias so release-notes generation maps both forms to the same contributor. --------- Co-authored-by: LifeJiggy <192385615+LifeJiggy@users.noreply.github.com> * fix: elevate plugin discovery failures from debug to warning Plugin discovery exceptions in gateway startup (gateway/run.py) and CLI startup (hermes_cli/main.py) are caught and logged at DEBUG level, making them invisible at the default INFO log level. If any plugin import fails — syntax error, missing dependency, import cycle — operators get zero indication unless they bump the log level to DEBUG. This makes broken plugins appear enabled but silently non-functional. Change both locations to logger.warning() so failures are visible at production log levels. Closes #28137 * fix: treat inline-shell timeout guard as timeout * fix(acp): resolve /tmp symlink before workspace auto-approve check on macOS Path.resolve() follows the /tmp -> /private/tmp symlink on macOS, so str(path).startswith("/tmp/") is always False for temp-dir paths. The "Accept Edits" (workspace_session) mode silently refused to auto-approve every /tmp write on macOS, breaking the documented behaviour and making the existing test fail on this platform. Fix: keep the raw expanded path (pre-resolve) for the /tmp prefix check and continue using the resolved form only for the cwd relative_to() call where symlink resolution is correct behaviour. * fix(kanban): single-row horizontal scroll for board columns Switch .hermes-kanban-columns from auto-fit CSS grid to a flex row with overflow-x: auto and a hidden scrollbar (scrollbar-width / ::-webkit- scrollbar), and pin .hermes-kanban-column to flex: 0 0 280px so columns sit side-by-side at a fixed width instead of wrapping into a 2xN grid. Page vertical scroll is unaffected: each column already caps at max-height: calc(100vh - 220px), so the container never grows tall enough to introduce its own vertical scrollbar. * fix(approval): surface pending-approval state with explicit marker visible to LLM When a tool call requires user approval in the non-blocking gateway path, the LLM previously received a result that was indistinguishable from a failed tool call (exit_code=-1, error=message). The LLM could not tell whether the tool was pending approval, had returned empty results, or had failed silently — causing it to burn context on wrong hypotheses. Fix changes the result format to include: - status: pending_approval (clear state name) - approval_pending: True (explicit boolean for LLMs to detect) - error: cleared to empty string (removes misleading error signal) This lets the LLM reason about approval latency vs actual errors, short-circuiting the previous silent failure mode. Fixes #14806 * fix: recognize emoji and caret as natural response endings GLM models via Ollama report finish_reason='stop' even when the response was truncated by max_tokens. The continuation mechanism uses _has_natural_response_ending() as one of the heuristics to detect whether the response was genuinely finished. Currently only ASCII punctuation and CJK punctuation are recognized. This means any response ending with an emoji (e.g. ⚡, 👍) or the caret character ^ (common in French ^^ smiley) is not recognized as naturally ended, triggering a false-positive continuation where the model receives 'Continue where you left off' and produces garbled output. Add: - ^ (caret) to the punctuation set - Unicode emoji range (codepoint >= 0x1F300) as natural ending This only affects GLM/Ollama users but the fix is safe for all backends since _has_natural_response_ending() is only consulted inside the continuation flow. * chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 8 (#28328) Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI. Contributors: - AceWattGit (#28159 — _pool_may_recover_from_rate_limit NameError) - YuanHanzhong (#28032 — x.com/status fallbacks link-like) - colin-chang (#28245, #28249, #28251 — gateway + mattermost fixes) - felix-windsor (#28019 — preserve cron asterisks in strip mode) - houenyang-momo (#28205 — charizard completion menu contrast) - iqdoctor (#28095 — windows installer docs) - joe102084 (#28151 — whitespace-only cron responses) - jvinals (#27936 — Slack U-IDs → DM channel) - maxmilian (#28267 — ModelPickerDialog portal) - samggggflynn (#27952 — dingtalk pre_start) Per references/batch-pr-salvage-may14-additions.md. * fix: add pre_start() to _IncomingHandler for dingtalk SDK compatibility The dingtalk-stream SDK calls pre_start() on every registered handler before opening the WebSocket connection. Without this method, the SDK raises AttributeError and kills the stream connection, causing DingTalk to be unable to connect via Stream Mode. * fix(windows): handle redirected stdout in _cprint fallback Wraps _pt_print in try/except with a print() fallback. When a kanban worker's stdout is piped to a log file, prompt_toolkit raises NoConsoleScreenBufferError (Windows) or OSError (other) because there is no real console buffer. The fallback keeps worker output flowing instead of crashing. * chore(release): alias stale-ID salvage commit for @Grogger (#28334) PR #28330 was salvaged with a wrong noreply numeric ID (18091625 vs the correct 7065068). The commit on main is correctly authored to Grogger by username, but neither noreply form was in AUTHOR_MAP. Adds both so release-notes generation maps them to @Grogger. * fix(aux): remove stale session_search model menu entry * fix(tui): keep x status citation fallbacks link-like * fix(xai-oauth): quarantine dead tokens on terminal refresh failure resolve_xai_oauth_runtime_credentials() called _refresh_xai_oauth_tokens() with no try/except. A terminal refresh failure (HTTP 400/401/403 — invalid_grant, token revoked) propagated without clearing the dead access_token / refresh_token from auth.json, causing every subsequent session to retry the same doomed network request. Add a try/except around the refresh call that mirrors the existing credential_pool.py quarantine: when _is_terminal_xai_oauth_refresh_error identifies a non-retryable failure, clear the dead token fields from auth.json and write a last_auth_error diagnostic marker so future calls fail fast with a clear relogin_required error instead of hitting the network. active_provider is preserved (set_active=False) so multi-provider users whose chosen provider is not xai-oauth are unaffected. Tests: two new cases in test_auth_xai_oauth_provider.py cover terminal quarantine and transient pass-through. * feat(bg-review): add bundled/pinned skill protection rules to review prompts (#27644) The background review prompts (_SKILL_REVIEW_PROMPT and _COMBINED_REVIEW_PROMPT) now include explicit protection rules for bundled, hub-installed, and pinned skills — aligning with the curator's existing policy at curator.py L345/350. Before this change, bg-review could freely rewrite bundled skills like 'hermes-agent' or pinned skills, while the 7-day curator explicitly skips them. The review agent now sees: • Bundled skills (shipped with Hermes) • Hub-installed skills (installed via hermes skills install) • Pinned skills (marked via hermes curator pin) If only protected skills need updating, the review says 'Nothing to save.' and stops. Fixes #27644 * fix(web): portal Change Model modal so it renders above the app sidebar The dashboard's main column is `relative z-2` (App.tsx), which creates a stacking context that traps fixed descendants below the app sidebar (`z-50`). `ModelPickerDialog` renders `fixed inset-0 z-[100]` inline, so its z-100 is scoped to z-2 and the sidebar covers its left edge. The bug is visible across all themes but only obvious in the Large theme variants (Hermes Teal (Large), etc.) where the larger root font widens the dialog into the sidebar's column. Toast.tsx already documents the same trap and uses the same `createPortal(..., document.body)` escape. This commit ports the picker; the same pattern affects other inline z-[100] modals in the dashboard (OAuthLoginModal, Cron / Models / Profiles page modals) and is left for a follow-up — keeping this PR scoped to the reporter's specific case. Fixes #28103 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gateway): exit code 75 on service restart so launchd relaunches When the gateway receives SIGUSR1 (graceful restart via launchd_restart), the SIGUSR1 handler calls request_restart(via_service=True) and the gateway shuts down cleanly with exit code 0. However, the generated launchd plist uses KeepAlive → SuccessfulExit → false, meaning launchd only relaunches on *non-zero* exit codes. A clean exit(0) is treated as "successful, don't restart", so the gateway stays down after /restart, /update, or SIGUSR1. The systemd unit template already uses RestartForceExitStatus=75 for the same scenario. Mirror that convention: when _restart_via_service is True, raise SystemExit(75) so launchd's SuccessfulExit=false policy triggers a relaunch. Closes #28135 * fix: guard json.loads() against invalid TTS and skill_view responses Two code paths call json.loads() on output from external tools without catching JSONDecodeError. If the tool returns a non-JSON string (error message, empty string, or None), the entire call path crashes. 1. gateway/run.py — text_to_speech_tool() result in voice reply path. A TTS failure that returns an error string instead of JSON crashes the voice reply handler, killing the message response entirely. 2. cron/scheduler.py — skill_view() result when loading skills for cron jobs. A corrupted or missing skill file that returns an error string instead of JSON crashes the cron tick, preventing all jobs from executing that cycle. Both fixes catch (json.JSONDecodeError, TypeError), log a warning, and gracefully skip the failed operation instead of crashing. * fix(gateway): bridge gateway_restart_notification from YAML platform sections Two related bugs in gateway/config.py prevented per-platform gateway_restart_notification from working through config.yaml: 1. The shared-key bridging loop (load_gateway_config) omitted 'gateway_restart_notification', so the key never landed in platform_data['extra'] even when set under e.g. 'discord:' or 'mattermost:' sections. 2. PlatformConfig.from_dict() only read gateway_restart_notification from the top-level data dict, ignoring the 'extra' sub-dict where bridged keys are stored. Fix: add the key to the bridging loop, and add an 'extra' fallback in from_dict() so that round-tripped values (YAML → bridged → extra → from_dict) resolve correctly. Impact: users can now set gateway_restart_notification: false per platform in config.yaml instead of relying on env vars or the global platforms: block. * feat(kanban): add auto_promote_children config toggle When the kanban auto-decomposer fans a triage task into child tasks, recompute_ready() immediately promotes parent-free children to 'ready' so the dispatcher picks them up. Some users want a manual workflow where children stay in 'todo' for review before dispatch. Add 'kanban.auto_promote_children' config key (default: true): - false: children stay in 'todo' after decomposition - true: existing behavior (auto-promote to 'ready') Changes: - kanban_db.py: decompose_triage_task() gains auto_promote param - kanban_decompose.py: reads auto_promote_children from config - kanban dashboard API: exposes the new setting in GET/PUT /orchestration Closes #28016 * fix: wrap _pool_may_recover_from_rate_limit call through run_agent namespace The conversation_loop.py references _pool_may_recover_from_rate_limit which was defined in run_agent.py. After the conversation-loop extraction refactor, the helper was no longer in the same module scope. Wrap the call as _ra()._pool_may_recover_from_rate_limit() to route through the run_agent monkeypatch namespace where the helper is available. Adds regression test in test_gemini_fast_fallback.py. Fixes: MAILROOM Email Triage NameError, OPS Execution Monitor NameError. * fix(tui): improve charizard completion menu contrast * docs(windows): avoid piping installer directly into iex * fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS Qwen3.x and DeepSeek-V3.x default to chatty/hallucinatory tool use without enforcement steering — agents narrate "calling tool X" without actually emitting a tool call, or run partial loops. Both model families fit the same failure pattern TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for (gpt, codex, gemini, gemma, grok, glm). Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com> Squashed salvage of: - 403e567ce fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS - 9433eabe7 test(agent): use realistic qwen-plus identifier in enforcement test Fixes #28079. * fix(send_message): resolve Slack user IDs to DM channel IDs The _SLACK_TARGET_RE regex only matched IDs starting with C (channel), G (group), or D (direct message). Slack user IDs start with U, causing 'Could not resolve' errors when trying to send DMs to specific users. Changes: - Expand _SLACK_TARGET_RE to accept U-prefixed IDs (user IDs) - Add conversations.open fallback to resolve user IDs to DM channel IDs before sending, since chat.postMessage requires a conversation ID Fixes #ISSUE_NUMBER * fix(gateway): tighten MEDIA extraction regex + silent skip on file-not-found Three related fixes for the MEDIA:<path> extraction pipeline that caused 'file not found' noise in platform channels: 1. run.py — tighten tool-result MEDIA regex from \S+ (any non- whitespace) to require a path pattern with known extensions. Prevents LLM-generated placeholder paths like 'MEDIA:/path/to/example.mp4' from being captured as real media. 2. base.py — remove the |\S+ fallback in extract_media() that catches anything non-whitespace as a potential MEDIA path. This was the primary cause of false positives — strings like '' in tool output were captured as MEDIA: paths. 3. mattermost.py — replace the file-not-found error message sent to the channel with a silent logger.warning() skip. When a path extracted by MEDIA doesn't exist on disk, the channel no longer gets a noisy '(file not found: ...)' message. Impact: eliminates the persistent 'file not found' spam in Mattermost channels caused by over-broad MEDIA regex patterns matching non-path text in tool output. * fix(xai-oauth): split 403 (tier/entitlement) from 400/401 in token endpoint xAI's token endpoint returns HTTP 403 to the OAuth grant when the account isn't on the allowlist for API access (e.g. standard SuperGrok subscribers — see #26847). Treating it like a stale-token 400/401 made ``format_auth_error`` append "Run ``hermes model`` to re-authenticate", which is misleading because re-login can't change xAI's tier decision. Split 403 off in both ``refresh_xai_oauth_pure`` and the loopback login token exchange: * New error code ``xai_oauth_tier_denied`` with ``relogin_required=False`` * Message explains the entitlement gate and points at the ``XAI_API_KEY`` + ``provider: xai`` fallback * 400/401 still set ``relogin_required=True`` as before * 5xx still set ``relogin_required=False`` as before * fix(run-agent): treat any 403 on xai-oauth as entitlement to stop refresh-loop The existing ``_is_entitlement_failure`` heuristic only fires when the response body contains specific substrings ("do not have an active Grok subscription", etc.). xAI has been seen to 403 standard SuperGrok subscribers with a terser body that doesn't match those keywords (#26847), and the recovery path would then mint a fresh token, get a fresh 403, and loop until Ctrl+C. Add a defense-in-depth check at the recovery call site: any 403 on ``provider == "xai-oauth"`` short-circuits ``try_refresh_current`` so the error surfaces immediately with the friendly hint from ``_summarize_api_error``. Keeps the existing keyword path for all other providers untouched. * test(xai-oauth): pin tier-denied 403 behavior + docs warning for #26847 Tests: * ``test_refresh_xai_oauth_pure_403_marked_tier_denied_not_relogin`` — refresh-403 raises ``xai_oauth_tier_denied`` with ``relogin_required=False`` and the API-key fallback hint in body. * ``test_format_auth_error_tier_denied_does_not_suggest_relogin`` — the renderer does not append "Run ``hermes model``" for the new code. * ``test_recover_with_credential_pool_skips_refresh_on_bare_403_for_xai_oauth`` — bare ``{"reason":"forbidden","message":"Forbidden"}`` body (which does not match the existing keyword heuristic) still short-circuits ``try_refresh_current`` on xai-oauth. Docs: * Drop the "(any active tier)" claim from the xai-grok-oauth guide, add a top-of-page warning callout, and a Troubleshooting section for the 403-after-login case pointing at ``XAI_API_KEY`` + ``provider: xai`` as the documented fallback. * fix: handle whitespace-only cron responses * fix(cli): preserve cron asterisks in strip mode * fix(mattermost): resolve thread root_id and route progress to threads Two Mattermost thread-related bugs: 1. _resolve_root_id() — Mattermost CRT requires root_id to be the thread root post. Using any reply's own ID as root_id causes '400 Invalid RootId'. Add _resolve_root_id() that walks up the post chain via API to find the actual root, and apply it in send(), _send_url_as_file(), and _send_local_file(). 2. _progress_reply_to — The condition in run.py only checked Platform.FEISHU, missing Mattermost entirely. This caused tool progress messages to always land in the main channel instead of the thread. Add Platform.MATTERMOST to the condition so progress messages are routed to threads when reply_mode=thread. Impact: Tool progress messages now appear in Mattermost threads instead of flooding the main channel; thread replies no longer fail with Invalid RootId when the reply target is itself a reply. * feat(kanban): archive --rm to hard-delete archived tasks Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to permanently remove already-archived tasks with cascading cleanup of links, comments, events, runs, and notify-subs. Safety guard: only archived tasks can be deleted; active/blocked/done must be archived first. Cherry-picked from #19964 onto current main (severe stale base, applied manually to preserve substance only). * feat(proxy): add xai upstream adapter for Grok via OAuth * chore(release): map @yannsunn for PR #28064 xai proxy adapter salvage * docs(skill): align kanban dispatcher failure_limit text with current default * fix(oauth): add manual-paste fallback for browser-only remote consoles xAI Grok OAuth (and Spotify) use a loopback redirect to ``http://127.0.0.1:<port>/callback`` to capture the authorization code. That works when the browser and Hermes run on the same machine, and the SSH tunnel recipe handles the regular remote case. It breaks completely on **browser-only remote consoles** (GCP Cloud Shell, GitHub Codespaces, AWS EC2 Instance Connect, Gitpod, Replit, …) where the user has a browser but no real SSH client to forward a port — the redirect to 127.0.0.1 on the remote VM simply isn't reachable from the laptop, and there's nothing the existing flow can do about it (#26923). This commit adds the foundation for a manual-paste fallback: * ``_is_remote_session`` now also recognises Cloud Shell, Codespaces, Gitpod, Replit, StackBlitz (in addition to SSH), so the existing tunnel hint at least fires in those environments. * ``_parse_pasted_callback`` accepts any of: a full ``http(s)://...?code=...&state=...`` URL, a bare ``?code=...`` query string, a bare ``code=...&state=...`` fragment, or a bare opaque code value. Returns the same dict shape the HTTP callback handler produces, so the caller's state / error validation works unchanged (no CSRF bypass). * ``_prompt_manual_callback_paste`` reads stdin with a clear multi-line explanation of what's happening and what to paste. * ``_xai_oauth_loopback_login`` gains a ``manual_paste`` kwarg that skips the HTTP listener entirely. The redirect_uri, PKCE verifier, state, and nonce are byte-identical to the loopback path so xAI's token endpoint can't tell the difference at the protocol level. * ``_print_loopback_ssh_hint`` now also mentions ``--manual-paste`` so users without a real SSH client see a path forward instead of a dead-end tunnel recipe. * ``_login_xai_oauth`` threads ``args.manual_paste`` into the loopback helper. * feat(cli): wire --manual-paste into ``hermes auth add`` and ``hermes model`` Register the new ``--manual-paste`` flag on both entry points and thread it through to the xAI loopback login: * ``hermes auth add xai-oauth --manual-paste`` — pool-add path, forwarded inside ``auth_commands.handle_auth_add``. * ``hermes model --manual-paste`` — model-picker path, forwarded by ``_model_flow_xai_oauth`` into the synthetic ``argparse.Namespace`` it passes to ``_login_xai_oauth``. The picker also now forwards ``--no-browser`` and ``--timeout`` for consistency (previously hardcoded to defaults regardless of CLI flags). Help text on both flags points at #26923 and names the browser-only remote consoles (Cloud Shell, Codespaces, EC2 Instance Connect) so users searching ``hermes --help`` can find the workaround. * test+docs(oauth): pin manual-paste semantics and document browser-only path (#26923) Tests (``tests/hermes_cli/test_auth_manual_paste.py``): * 9 parametrised + scalar cases for ``_is_remote_session`` covering the new Cloud Shell / Codespaces / Gitpod / Replit / StackBlitz env vars (plus the existing SSH ones). * 9 cases for ``_parse_pasted_callback`` covering every paste form (full URL, https URL with extra params, bare ``?code=...``, bare ``code=...`` fragment, bare opaque value, error+description, empty, whitespace-only, malformed URL). * 3 cases for ``_prompt_manual_callback_paste`` (happy path, EOF, Ctrl-C). * 3 end-to-end ``_xai_oauth_loopback_login(manual_paste=True)`` cases: the HTTP server MUST NOT be started (asserted via a callable that raises if invoked), wrong state still rejected with ``xai_state_mismatch`` (no CSRF bypass), and empty paste surfaces ``xai_code_missing``. * SSH-hint mention test ensures the ``--manual-paste`` instruction is printed in the remote-session hint. Docs: * ``oauth-over-ssh.md`` — new "Browser-only remote (Cloud Shell / Codespaces / EC2 Instance Connect)" section with the ``--manual-paste`` recipe, plus a TL;DR note for the new flag. * ``xai-grok-oauth.md`` — short subsection pointing at the same recipe and the OAuth-over-SSH guide anchor. * docs(kanban): document max-retries task override * docs(kanban): document inline create shortcuts * test(kanban): cover default board dashboard pin * docs: ignore box diagrams in ascii guard Wrap existing box-drawing diagrams with ascii-guard markers so docs-site checks pass when website docs are touched. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: per-task model override for kanban workers - Add model_override field to Task class and tasks schema - Add migration for existing databases - Spawn worker with -m model when model_override is set * test(kanban-dashboard): cover _task_dict task_age fallback The fix in 061a1830 added an outer try/except in plugin_api._task_dict so that a future failure mode in kanban_db.task_age (anything _safe_int doesn't already absorb) cannot 500 the GET /board response. The _safe_int / task_age corruption paths got regression coverage in tests/hermes_cli/test_kanban_db.py, but the OUTER fallback contract remained untested -- meaning a refactor that drops the try/except would not be caught by CI. Pin that contract from both consumers of _task_dict: - GET /board returns 200 with the literal fallback age dict for the affected card (other cards continue to render via the same path) - GET /tasks/:id (drawer view) returns 200 with the same fallback, so a single corrupt task can't block its own drawer Both tests force task_age to raise RuntimeError rather than ValueError on '%s', because ValueError is absorbed by _safe_int and never reaches the outer try/except -- testing that path would only re-cover what test_kanban_db.py already pins. Manually verified the regression discipline: git checkout 061a1830^ -- plugins/kanban/dashboard/plugin_api.py pytest -k task_age_exception # both FAIL with 500 git checkout HEAD -- plugins/kanban/dashboard/plugin_api.py pytest -k task_age_exception # both PASS * fix(kanban): clear _INITIALIZED_PATHS in remove_board so recycled DBs re-init schema Archiving or deleting a board via remove_board() leaves the path's "schema already initialized" entry in the module-level cache. A concurrent connect(board=<slug>) call (e.g. the dashboard event-stream poll loop) then: 1. resolves the same kanban.db path, 2. recreates the directory + an empty sqlite file because connect() does mkdir(parents=True, exist_ok=True), 3. skips the CREATE TABLE pass because the cache entry says the schema is already in place, 4. errors on the next read with `no such table: task_events`. Drop the cache entry before mutating the filesystem so the fresh file gets a proper schema init on next connect(). Applies to both archive=True (rename) and archive=False (rmtree) branches. Fixes #23833. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(web): add Cache-Control: no-store to plugin static file serving Prevents browser caching of stale dashboard plugin JS files that may contain bugs already fixed upstream (e.g. COLUMN_LABEL undefined). * fix(kanban): seed bundled skills (e.g. kanban-worker) on kanban init Closes #23725 * fix(kanban): ignore stale HERMES_KANBAN_BOARD for removed boards * fix(kanban): keep board-management commands independent from board override * fix(kanban): preserve notifier_profile for dashboard home subscriptions * fix(kanban): promote dependents when a parent is archived * fix(cli): make kanban specify max_tokens configurable * fix(kanban): sync slash subcommands with live parser * fix(kanban): promote blocked tasks when parent dependencies complete recompute_ready only scanned 'todo' tasks for promotion, ignoring 'blocked' tasks entirely. When a task was blocked (e.g. by the circuit breaker) and its parent dependencies later completed, the task stayed stuck in 'blocked' forever unless manually unblocked. Now recompute_ready also scans 'blocked' tasks. When all parents are done/archived, the blocked task is promoted to 'ready' with failure counters reset — equivalent to an automatic unblock. Includes a regression test for the blocked-parent-done promotion path. * fix(kanban): use 'is not None' check for max_runtime_seconds in create_task max_runtime_seconds=0 was being silently coerced to None due to a falsy check (if max_runtime_seconds). Zero is a valid value that causes the dispatcher to immediately time out a task. The adjacent max_retries parameter already used the correct 'is not None' pattern. Fixes the inconsistency by aligning max_runtime_seconds with max_retries. * fix(kanban): reset failure counters on unblock_task When a task is manually unblocked (blocked → ready/todo), the consecutive_failures counter and last_failure_error were left intact. The next failure would immediately re-trip the circuit breaker because the counter was still at or above the failure limit. Reset both fields on unblock so the task gets a fresh retry budget. Includes a regression test that verifies counters are zeroed. * fix(kanban): fingerprint crash errors to prevent fleet-wide retry exhaustion When a systemic failure (provider outage, auth expiry, OOM) crashes multiple workers simultaneously, detect_crashed_workers increments each task failure counter independently. The circuit breaker only trips after N × failure_limit retries across the fleet. Fingerprint crash errors by normalizing host-specific details (PIDs, timestamps). When 3+ tasks crash with the same fingerprint in a single detection cycle, immediately trip the circuit breaker (failure_limit=1) instead of waiting for repeated failures. Isolated crashes (unique fingerprints) retain their normal retry budget. Protocol violations continue to trip immediately. Includes regression tests for systemic and isolated crash paths. * fix(kanban): align board_exists with board discovery rules * fix(kanban): demote ready children when a parent is reopened * fix(kanban): serialize DB initialization * fix(kanban): task_age() tolerates ISO-8601 timestamps Prevents ValueError crash in dashboard get_board() when a task has an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch int. Adds _to_epoch() helper that normalises both formats. * Fix Kanban dashboard initial board selection * fix(kanban): persist worker session metadata on completion Salvages #25579 by @wesleysimplicio. Stamps task_runs.metadata.worker_session_id from HERMES_SESSION_ID on kanban_complete. Cherry-picked the substantive commit (not the AUTHOR_MAP fixup tip) onto current main. * fix(kanban): make claim ttl configurable Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(kanban): pass accept-hooks to worker chat subprocess * feat(kanban): add board-level default workdir (#25430) * docs(kanban-worker): document notification routing configuration * fix(kanban): preserve worker tools with restricted toolsets * fix(kanban): make legacy task migration idempotent (cherry picked from commit 293f1c3a7241b0117669e049d9aa746c9645ac90) * fix: harden Kanban worker Hermes command resolution * feat(kanban): allow trimmed task comments SS-1647 live SHIP validation: real code + tests for kanban comment --max-len. * fix: show scheduled kanban tasks in dashboard * fix: assign single-task kanban decompositions * fix(kanban-dashboard): make Orchestration mode checkbox label static The checkbox label echoed its state ("Auto (default)" / "Manual") instead of describing the action, so a checked box reading "Auto" parsed as a status indicator rather than a control. The accompanying sub-description was also static and started with "When on, ...", which read awkwardly when the box was unchecked. Replace the dynamic label with a static action label ("Auto-decompose triage tasks") and flip the sub-description between the two modes so it stays accurate either way. The top-of-page Orchestration pill is unchanged — that one is intentionally a status badge / toggle. Fixes #28178 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(env): add HERMES_KANBAN_DISPATCH_IN_GATEWAY override (#21956) Salvages the env-vars docs portion of #21956 by @Bartok9. The ascii-guard-ignore tags from the original PR already landed on main. * fix(kanban): close sqlite connection on init failure to prevent fd leak Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema init raises after sqlite3.connect() succeeds, the new connection was leaking. Wrap the body in try/except so the connection is closed before the exception propagates. * fix(kanban): don't crash dispatched workers when kanban-worker skill is absent Salvages #27372 by @oemtalks. The dispatcher unconditionally injected `--skills kanban-worker` into every worker spawn, but worker profiles sometimes don't have that bundled skill in their skills dir, which is fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`). Adds `_kanban_worker_skill_available(hermes_home)` and only injects the flag when the skill resolves. The MANDATORY lifecycle still ships via KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe. * fix(packaging): ship dashboard plugin assets in wheel Salvages #23737 by @LeonSGP43. Adds plugins/* manifest.json and dist/ glob entries to setuptools package-data so wheel installs ship the bundled dashboard plugin assets (kanban, achievements, etc.). Without these, /api/dashboard/plugins can't discover plugin assets outside a source checkout. * docs(kanban): document worker protocol auto-blocks Salvages #21585 by @helix4u. Documents the protocol_violation event (worker exits successfully while task is still running), adds --max-retries to the create flag list and --failure-limit to dispatch. * fix(oneshot): pass fallback_providers from profile config to AIAgent Salvages #23368 by @uzunkuyruk. Oneshot workers (e.g. kanban workers spawned via 'hermes -p <profile> chat -q ...') were not honouring the profile's fallback_providers / fallback_model chain because oneshot.py never read the config and never passed fallback_model= to AIAgent. Reads cfg.get('fallback_providers') (new list format) or cfg.get('fallback_model') (legacy single-dict) with the same normalization cli.py applies, then forwards as fallback_model=_fb. * fix(kanban): reject direct running transitions in dashboard bulk updates Salvages #24050 by @kronexoi. The single-task PATCH already rejects direct status='running' since it bypasses the dispatcher/claim invariant, but the bulk-update endpoint still accepted it. Aligns bulk with single by emitting an error result row for any 'running' entry. * feat(kanban): add initial-status for human-ops cards Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag (running|blocked, default running) to 'kanban create', threaded through kanban_db.create_task() and the kanban_create tool schema. 'blocked' parks the task directly in the blocked column for R3 human-ops review, skipping the brief running-to-blocked transition. Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and slash-handler error formatting changes that were bundled in the original PR — those should ship as their own focused changes if still wanted. * fix(kanban): release scratch workspace and tmux session on task completion Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace() and _cleanup_worker_tmux() after marking a task complete. Scratch workspaces (used by swarm agents) accumulate on disk — hundreds of MB per task, never released. Stale tmux sessions from completed agents also persist indefinitely. Both gates are safe: - workspace_kind == 'scratch' gate preserves user worktree/dir workspaces - tmux #{pane_dead} == 1 gate only kills sessions where the worker has already exited - best-effort: cleanup failures never block task completion * fix(kanban): honor severity thresholds in diagnostics Salvages #26431 by @LeonSGP43. Dashboard plugin_api list_diagnostics was using exact-match (severity == filter), so '--severity warning' hid 'error' and 'critical' diagnostics. Adds severity_at_or_above() helper to kanban_diagnostics and uses it in the dashboard endpoint (CLI already used SEVERITY_ORDER comparison correctly). * test: isolate Kanban env pins in hermetic fixture Salvages the substantive part of #22295 by @steezkelly. Adds the missing HERMES_KANBAN_HOME, HERMES_KANBAN_RUN_ID, HERMES_KANBAN_CLAIM_LOCK, HERMES_KANBAN_DISPATCH_IN_GATEWAY entries to _HERMES_BEHAVIORAL_VARS so ambient developer-shell pins on those vars don't bleed into pytest runs. The frozenset extraction + standalone regression test from the original PR were dropped to keep the change minimal — main already maintains the list inline. * feat(kanban): add max_in_progress config to cap concurrent running tasks Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config that caps simultaneously running tasks. When the board already has N running, dispatcher skips spawning so slow workers (local LLMs, resource-constrained hosts) don't pile up and time out. Threads through dispatch_once(max_in_progress=) and gateway dispatcher config parsing with validation (warns on invalid/below-1 values). * fix(packaging): ship bundled skills in wheel Salvages #23738 by @LeonSGP43. Wheel installs were missing skills/ and optional-skills/ because pyproject's [tool.setuptools.packages.find] only includes Python packages — the skills directories don't have __init__.py so they were silently dropped from the wheel. Adds setup.py with data_files spec emitting skills/* and optional-skills/* under hermes_agent-<v>.data/data/, and a get_bundled_skills_dir() helper in hermes_constants that discovers the wheel-installed location via sysconfig before falling back to a source-checkout path. tools/skills_sync uses the helper so 'hermes update' works for pip-installed users. * fix: 4 small surgical bugs Salvages #23302 by @Bartok9. Four independent one-area fixes: 1. kanban boards delete alias now hard-deletes (not archives) — the alias didn't carry --delete, so getattr(args, 'delete', False) returned False. Detect boards_action=='delete' explicitly. 2. Gateway auto-title failures no longer leak as user-visible warnings — debug-log only since they're not actionable. 3. Background process completion notification snaps truncation to the next newline boundary, prepends a marker when content is dropped. 4. _cprint() schedules the run_in_terminal coroutine via asyncio.ensure_future so output isn't silently dropped from background threads (fixes #23185 Bug A). Skips the double-print fallback that would fire for mock paths. * perf(prompt): cache kanban worker guidance at session init Salvages #24402 by @RyanRana. The KANBAN_GUIDANCE block (~835 tokens) is session-static — the dispatcher decides at spawn time whether the process is a kanban worker via the kanban_show tool's check_fn (gated on HERMES_KANBAN_TASK env var). Re-checking 'kanban_show' in valid_tool_names and re-loading the reference on every system-prompt rebuild (init + each context compression) is wasted work. Caches the resolved string on agent._kanban_worker_guidance once in agent_init and consumes it in system_prompt.build_system_prompt(), with a getattr fallback for code paths that bypass agent_init. * feat(kanban): add --sort option to 'hermes kanban list' Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc, priority,priority-desc,status,assignee,title,updated} to 'hermes kanban list'. Validated against VALID_SORT_ORDERS map; invalid values raise ValueError. Default behaviour (priority DESC, created ASC) is unchanged when --sort is omitted. * docs: add kanban codex lane skill * feat(kanban): worker visibility endpoints (workers/active, runs/{id}, inspect) Adds three read-only endpoints to the kanban dashboard plugin so the SwitchUI workspace (and any other dashboard consumer) can track workers across tasks without N+1 round-trips through /tasks/{task_id}. - GET /workers/active Single SQL JOIN of task_runs + tasks where ended_at IS NULL, worker_pid IS NOT NULL, status='running'. Returns {workers: [...], count, checked_at}. - GET /runs/{run_id} Direct lookup of any task_run row by id. Reuses existing kanban_db.get_run() helper and _run_dict() serialiser. 404 when not found. Mirrors GET /tasks/{task_id} 404 shape. - GET /runs/{run_id}/inspect Live PID stats via psutil.Process.as_dict() — cpu_percent, memory_rss_bytes, memory_vms_bytes, num_threads, num_fds, status, create_time, cmdline. Short-circuits with alive:false when run has ended, has no worker_pid, the pid is gone, or psutil is unavailable. AccessDenied surfaces as alive:true with error rather than a 500. 11 new tests in tests/plugins/test_kanban_worker_runs.py cover the empty-board case, running-task case, ended-run filtering, missing-pid filtering, 404 paths, already-ended inspect, no-pid inspect, dead-pid inspect, and live-pid inspect (psutil mocked). All pass. Companion termination endpoint (POST /runs/{run_id}/terminate) is intentionally out of scope here — opening a separate issue first since the RBAC and dispatcher-mediated soft-cancel design needs maintainer input before code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): map contributor email for attribution check * test(kanban-dashboard): pin enriched 409 detail and inline error wiring (#26744) - Existing ``test_patch_drag_drop_move_todo_to_ready`` now asserts the enriched 409 detail names the blocking parent (id, quoted title, and current status), so the dashboard always has something actionable to render. - New bundle-assertion test ``test_dashboard_surfaces_ready_blocked_error_inline`` pins the frontend wiring: the ``parseApiErrorMessage`` helper exists, the drag/drop banner runs through it, and the drawer maintains a visible ``patchErr`` state that's cleared between PATCHes and tasks. * docs(codex_app_server): document multi-root Kanban writable_roots (#27941) Update the Codex app-server runtime guide's Kanban section to reflect the new behaviour: * The sandbox override now adds the board DB directory plus every Kanban path the dispatcher pinned (HERMES_KANBAN_WORKSPACES_ROOT, HERMES_KANBAN_WORKSPACE, legacy HERMES_KANBAN_ROOT) -- deduplicated, DB-dir first. * The motivation note now includes the cross-mount artifact-write scenario (e.g. ``/media/.../kanban-workspaces/...`` on a separate drive) and links to issue #27941 so readers can find the original bug report. * fix(gateway): quiet corrupt kanban dispatcher boards Salvages substantive part of #26490 by @aqilaziz. Detects corrupt board DBs ("file is not a database" / "database disk image is malformed") and disables them by fingerprint until they're repaired, instead of flooding the gateway log with repeated logger.exception tracebacks every tick. Cherry-picked the substantive commit (ea5b4ec2a); the tip commit was an unrelated _is_dir OSError fix for service-path lookup. Dropped a small test reformat that was bundled in the same commit. * docs: align kanban readiness docs and smoke tests Salvages #28199 by @bensargotest-sys. Aligns Kanban docs with current tool registration: dispatcher-spawned task workers get task tools, profiles that explicitly enable the kanban toolset get orchestrator routing tools (kanban_list, kanban_unblock). Corrects failure-limit text to current default of 2. Hardens the e2e subprocess script to resolve repo root and use the spawnable default assignee. Updates the diagnostics severity fixture to assert error below the critical threshold. * feat(kanban): surface per-task model_override in show + tool output Salvages #26897 by @loicnico96. The per-task model_override DB column already exists on main, but it wasn't exposed in user-facing surfaces. This adds: - 'kanban show' prints 'model: <name>' when model_override is set - kanban_show / kanban_list tool responses include the model_override field Original branch was stale (PR was authored against an older field name 'model'); applied the substantive surface exposure manually using the current 'model_override' field name. * feat(cli): add kanban swarm topology helper Salvages #26791 by @Niraven. Adds 'hermes kanban swarm' to create a durable Kanban Swarm v1 graph: a completed root/blackboard card, parallel worker cards, a verifier gated on all workers, and a synthesizer gated on the verifier. Stores shared swarm blackboard updates as structured JSON comments on the root card. Self-contained: new hermes_cli/kanban_swarm.py module + CLI wiring + unit tests. * feat(kanban): add optional board parameter to all MCP tools Salvages #27598 by @nnnet. Adds optional 'board' parameter to all 9 kanban_* MCP tools via shared _connect helper. Backwards compatible — omitting board keeps current pinned-board behavior. Useful for orchestrator profiles that route across multiple boards. Two-file scope: tools/kanban_tools.py + tests. * feat(kanban): stamp originating ACP session_id on tasks Salvages #23208 by @awizemann. Tracks which chat session created a kanban task so clients can render a per-session board without falling back to tenant + time-window heuristics. - Schema: tasks gains nullable session_id TEXT column with index (additive migration in _migrate_add_optional_columns). - ACP: server.py exposes the originating session id via HERMES_SESSION_ID with save/restore around the agent loop. - Tool: kanban_create reads HERMES_SESSION_ID (with explicit override). - CLI: 'hermes kanban list --session <id>' filter; JSON output exposes session_id. * feat(kanban): wire dispatcher to dispatch review agents from review column Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task status and extends dispatch_once to monitor the review column as a second dispatch source (in addition to the existing ready column). - Adds 'review' to VALID_STATUSES - Adds claim_review_task() — atomically transitions review → running - Adds has_spawnable_review() — health telemetry mirror - Extends dispatch_once with a review column dispatch loop - Review agents get 'sdlc-review' skill auto-loaded Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state, test file additions). Adapted claim_review_task to main's ttl_seconds: Optional[int] = None convention (matches claim_task). * feat(kanban): stale detection for running tasks in dispatcher Salvages #23790 by @thewillhuang. Adds detect_stale_running() to the dispatcher cycle. Running tasks that have been started for longer than dispatch_stale_timeout_seconds (default 14400 = 4h) without a heartbeat in the last hour are auto-reclaimed to ready. - New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables) - New 'stale' field on DispatchResult - detect_stale_running() in kanban_db.py with heartbeat freshness check - Records outcome='stale' on run close + 'stale' event; ticks failure counter - Wires config through gateway embedded dispatcher - Updates _cmd_dispatch verbose/JSON output and daemon logging Resolved test-file end-of-file conflict by appending both halves. * feat(kanban): filter tasks by workflow fields and runs by status/outcome Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing workflow_template_id and current_step_key columns: - list_tasks() accepts workflow_template_id and current_step_key kwargs - 'hermes kanban list' adds matching CLI flags - dashboard plugin_api also exposes the filters Resolved a small conflict in list_tasks signature alongside main's session_id and order_by additions; combined all three into the single filter list. * feat(kanban): add respawn guard to block repeat worker storms Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker spawn for tasks where: - a recent run already succeeded (recent_success — within guard window) - the previous run hit a quota/auth error (blocker_auth, also auto-blocks) - a recent task comment includes a GitHub PR URL (active_pr) The guard prevents repeat worker storms on the same bug/task. Includes the contributor's review-findings fixup (regex hardening, observability, auth coverage). Resolved a small DispatchResult conflict alongside main's 'stale' field; kept both. Authorship preserved via rebase merge. * feat(kanban): show dashboard cron jobs across profiles Salvages #27568 by @SerenityTn. Dashboard cron page now lists cron jobs from all profiles, with profile-aware filter UI and storage routing. Includes test coverage for cross-profile listing, mutation, deletion, and validation. Also fixes orphan conflict markers in config.py left by an earlier salvage merge (kanban.dispatch_stale_timeout_seconds was double-nested in HEAD/PR markers from #28452 salvage of #23790). * fix(kanban): remove orphan conflict markers from config.py (#28458) PR #28452 (salvage of #23790, stale detection) merged with leftover git conflict markers in hermes_cli/config.py around the `dispatch_stale_timeout_seconds` config block, breaking config import and any code path that loads it. Cleans up the markers and keeps both config blocks (worker log rotation/orchestrator + stale detection). Resolves a self-introduced regression. * fix(kanban): remove orphan conflict markers from kanban.py (#28459) PR #28454 (salvage of #26745, workflow filter) merged with leftover git conflict markers in hermes_cli/kanban.py at three sites: - _task_to_dict() (session_id alongside workflow_template_id/current_step_key) - p_list parser (--sort alongside --workflow-template-id/--step-key) - _cmd_list (order_by alongside the new filter kwargs) Cleans up the markers and keeps both halves at each site. Resolves a self-introduced regression. * feat(kanban): configure worktree paths and branches Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so tasks with workspace_kind='worktree' can pin a target branch on create. Schema migration added to _migrate_add_optional_columns. - Task.branch_name field + DB column + migration - create_task accepts branch_name kwarg - hermes kanban create --branch <name> flag - kanban show output includes 'Branch: <name>' when set Cherry-picked the substantive commit (a7558cf27); the PR's tip was an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list and show-output conflicts alongside main's session_id and max_runtime_seconds additions; kept all three. * feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373) Skill bundles are tiny YAML files in ~/.hermes/skill-bundles/ that group several skills under one slash command. Invoking /<bundle-name> from any surface (CLI, TUI, dashboard, any gateway platform) loads every referenced skill into a single combined user message. Use cases: - /backend-dev → loads github-code-review + test-driven-development + github-pr-workflow as one bundle. - /research → loads several research skills together. - Team task profiles shared via dotfiles. Behavior: - Bundles take precedence over individual skills when slugs collide. - Missing skills are skipped with a note, not fatal. - No system-prompt mutation — bundles generate a fresh user message at invocation time, the same way /<skill> does. Prompt cache stays intact. - Works in CLI dispatch, gateway dispatch, autocomplete (CLI + TUI), /help display. Schema (~/.hermes/skill-bundles/<slug>.yaml): name: backend-dev description: Backend feature work. skills: - github-code-review - test-driven-developme…

… ops Three install.ps1 improvements pulled from the thin-installer work on bb/gui (PR NousResearch#27822) that benefit the canonical CLI install flow on main: 1. Strip UTF-8 BOM from scripts/install.ps1. The canonical 'irm <raw URL> | iex' install flow has been broken since commit 4279da4 re-introduced a UTF-8 BOM that PR NousResearch#27224 had explicitly stripped. PowerShell 5.1's 'irm' returns the response body as a string with the BOM surviving as a leading \ufeff character; 'iex' then evaluates that string and the parser chokes on the invisible character before param(), surfacing as a cascade of 'The assignment expression is not valid' errors at every param default value. File body is verified pure ASCII (no character above byte 127), so PS 5.1 with no BOM falls back to Windows-1252 decoding which is identical to ASCII for our content. Both install paths work: - 'irm ... | iex' (canonical one-liner) - 'powershell -File install.ps1' (programmatic / desktop bootstrap) 2. New -Commit and -Tag string params for reproducible pinning. Higher-precedence variants of -Branch. When set, the repository stage clones $Branch (fast partial fetch) and then 'git checkout's the exact ref. Precedence: Commit > Tag > Branch. Honoured by all three code paths: - Update path (existing valid checkout): fetch + checkout --detach <commit|tag> instead of checkout + pull. - Fresh clone: clone --branch $Branch, then post-clone 'git checkout --detach' to the requested ref. - ZIP fallback: pick archive URL for the most-specific ref (commit -> archive/<sha>.zip, tag -> archive/refs/tags/ <tag>.zip, else archive/refs/heads/<branch>.zip). Used by the Hermes desktop's first-launch bootstrap to pin the .exe to the exact commit it was built against, so the cloned Hermes Agent tree always matches what the .exe was tested with. Also enables release-bundle pinning (e.g. Microsoft Store builds pinning to a release tag) and CI reproducibility. 3. EAP=Continue wrap around the new pin-step git invocations. 'git fetch origin <commit>' writes the routine 'From <url>' info line to stderr. Under the script's global $ErrorActionPreference = 'Stop' that stderr line is wrapped as an ErrorRecord and terminates the script even though fetch+checkout actually succeed. Same EAP=Stop + native-stderr footgun we hit during the install.ps1 hardening pass in Install-Uv, Test-Python, _Run-NpmInstall. Wrap both the update-path fetch/checkout block AND the post-clone pin block in $ErrorActionPreference = 'Continue' (restored in finally). Real failures still caught by $LASTEXITCODE checks.

Three issues flagged by the Copilot review on this PR: 1. Double JSON emit on stage failure (Copilot NousResearch#1, NousResearch#2). When -Stage <name> ran a worker that threw, Invoke-Stage's finally emitted a JSON result frame AND the entry-point catch emitted a second error frame -- producing two concatenated JSON objects on stdout and breaking the one-line-per-invocation contract that drivers parse against. Same issue applied to -Json mode on a full install (every stage's finally plus a final error frame missing duration_ms/skipped). Fix: Invoke-Stage's finally now sets $script:_StageEmittedErrorFrame when it emits a failure frame; the entry-point catch checks the flag and skips its own emit, still exit 1. 2. $prevEAP uninitialized on early try-block throw (Copilot NousResearch#3). In Install-Uv, Test-Python, Test-Node's winget fallback, _Run-NpmInstall, and the playwright block, '$prevEAP = $ErrorActionPreference' lived as the first statement INSIDE the try. If anything between 'try {' and that line threw (Write-Info on an unusual host, the npx-finding loop, etc.), the catch's 'if ($prevEAP) { ... }' restore was a no-op and EAP could remain relaxed. Fix: hoist '$prevEAP = $ErrorActionPreference' to the line immediately before 'try {' in all five sites. Catch's restore is now always meaningful regardless of where in the try the throw originated. No change to Invoke-Stage's success path or to the four lint-clean EAP sites (Test-Node was the only winget-related catch). All 19 metadata smoke tests still pass.

… ops Three install.ps1 improvements pulled from the thin-installer work on bb/gui (PR NousResearch#27822) that benefit the canonical CLI install flow on main: 1. Strip UTF-8 BOM from scripts/install.ps1. The canonical 'irm <raw URL> | iex' install flow has been broken since commit 4279da4 re-introduced a UTF-8 BOM that PR NousResearch#27224 had explicitly stripped. PowerShell 5.1's 'irm' returns the response body as a string with the BOM surviving as a leading \ufeff character; 'iex' then evaluates that string and the parser chokes on the invisible character before param(), surfacing as a cascade of 'The assignment expression is not valid' errors at every param default value. File body is verified pure ASCII (no character above byte 127), so PS 5.1 with no BOM falls back to Windows-1252 decoding which is identical to ASCII for our content. Both install paths work: - 'irm ... | iex' (canonical one-liner) - 'powershell -File install.ps1' (programmatic / desktop bootstrap) 2. New -Commit and -Tag string params for reproducible pinning. Higher-precedence variants of -Branch. When set, the repository stage clones $Branch (fast partial fetch) and then 'git checkout's the exact ref. Precedence: Commit > Tag > Branch. Honoured by all three code paths: - Update path (existing valid checkout): fetch + checkout --detach <commit|tag> instead of checkout + pull. - Fresh clone: clone --branch $Branch, then post-clone 'git checkout --detach' to the requested ref. - ZIP fallback: pick archive URL for the most-specific ref (commit -> archive/<sha>.zip, tag -> archive/refs/tags/ <tag>.zip, else archive/refs/heads/<branch>.zip). Used by the Hermes desktop's first-launch bootstrap to pin the .exe to the exact commit it was built against, so the cloned Hermes Agent tree always matches what the .exe was tested with. Also enables release-bundle pinning (e.g. Microsoft Store builds pinning to a release tag) and CI reproducibility. 3. EAP=Continue wrap around the new pin-step git invocations. 'git fetch origin <commit>' writes the routine 'From <url>' info line to stderr. Under the script's global $ErrorActionPreference = 'Stop' that stderr line is wrapped as an ErrorRecord and terminates the script even though fetch+checkout actually succeed. Same EAP=Stop + native-stderr footgun we hit during the install.ps1 hardening pass in Install-Uv, Test-Python, _Run-NpmInstall. Wrap both the update-path fetch/checkout block AND the post-clone pin block in $ErrorActionPreference = 'Continue' (restored in finally). Real failures still caught by $LASTEXITCODE checks.

alt-glitch added type/refactor Code restructuring, no behavior change P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard labels May 17, 2026

github-actions Bot mentioned this pull request May 17, 2026

🦞 OpenClaw 生态日报 2026-05-17 ivanweng2077/big_model_radar#55

Open

jquesnelle force-pushed the jq/install-ps1-stage-protocol branch from 60c881e to 5165819 Compare May 17, 2026 04:47

jquesnelle marked this pull request as ready for review May 17, 2026 05:05

OutThisLife requested a review from Copilot May 17, 2026 05:09

Copilot started reviewing on behalf of OutThisLife May 17, 2026 05:09 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

alt-glitch reviewed May 17, 2026

View reviewed changes

alt-glitch previously requested changes May 17, 2026

View reviewed changes

alt-glitch reviewed May 17, 2026

View reviewed changes

NousResearch deleted a comment from github-actions Bot May 17, 2026

NousResearch deleted a comment from cardtest15-coder May 17, 2026

NousResearch deleted a comment from OutThisLife May 17, 2026

alt-glitch previously requested changes May 17, 2026

View reviewed changes

teknium1 merged commit fb138d9 into main May 17, 2026
10 checks passed

teknium1 deleted the jq/install-ps1-stage-protocol branch May 17, 2026 05:55

alt-glitch mentioned this pull request May 18, 2026

refactor(bootstrap): consolidate ACP browser bootstrap into install.{sh,ps1} #26668

Closed

alt-glitch mentioned this pull request May 18, 2026

Bootstrap consolidation: pip/uvx browser setup broken + Windows dep_ensure missing #27826

Closed

4 tasks

jquesnelle mentioned this pull request May 18, 2026

feat(install.ps1): strip BOM, add -Commit/-Tag pin params, harden git ops #28169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert install.ps1 to new bootstrap protocol#27224

convert install.ps1 to new bootstrap protocol#27224
teknium1 merged 4 commits into
mainfrom
jq/install-ps1-stage-protocol

jquesnelle commented May 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

alt-glitch left a comment

Uh oh!

alt-glitch left a comment

Uh oh!

alt-glitch left a comment

Uh oh!

alt-glitch left a comment

Uh oh!

alt-glitch May 17, 2026

Uh oh!

alt-glitch May 17, 2026

Uh oh!

alt-glitch May 17, 2026

Uh oh!

alt-glitch May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jquesnelle commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as resolved.

Uh oh!

alt-glitch left a comment

Choose a reason for hiding this comment

Uh oh!

alt-glitch left a comment

Choose a reason for hiding this comment

🐛 Request Changes: Double JSON on failed stages

Uh oh!

alt-glitch left a comment

Choose a reason for hiding this comment

Uh oh!

alt-glitch left a comment

Choose a reason for hiding this comment

Review — PR #27224: install.ps1 stage protocol + Windows clean-VM hardening

Testing performed

What's good

Bugs — see inline comments

Adversarial findings rejected (verified false on PS 5.1)

Uh oh!

alt-glitch May 17, 2026

Choose a reason for hiding this comment

Uh oh!

alt-glitch May 17, 2026

Choose a reason for hiding this comment

Uh oh!

alt-glitch May 17, 2026

Choose a reason for hiding this comment

Uh oh!

alt-glitch May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: jq/install-ps1-stage-protocol vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jquesnelle commented May 17, 2026 •

edited

Loading

github-actions Bot commented May 17, 2026 •

edited

Loading

🔎 Lint report: `jq/install-ps1-stage-protocol` vs `origin/main`