Skip to content

fix: handle whitespace-only cron responses#28151

Closed
joe102084 wants to merge 1 commit into
NousResearch:mainfrom
joe102084:fix/cron-whitespace-empty-response
Closed

fix: handle whitespace-only cron responses#28151
joe102084 wants to merge 1 commit into
NousResearch:mainfrom
joe102084:fix/cron-whitespace-empty-response

Conversation

@joe102084

Copy link
Copy Markdown
Contributor

Summary

  • Treat whitespace-only cron final responses the same as empty responses.
  • Avoid delivering blank cron messages to users.
  • Mark those runs as soft failures so broken/model-empty jobs remain visible in cron status.

Test Plan

  • scripts/run_tests.sh tests/cron/test_scheduler.py::TestSilentDelivery -v --tb=short
  • scripts/run_tests.sh tests/cron/test_scheduler.py -q

Note: local working tree had a pre-existing uncommitted ui-tui/package-lock.json diff; this PR only includes cron/scheduler.py and tests/cron/test_scheduler.py.

@BoardJames-Bot BoardJames-Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by Hermes Agent. Whitespace-only cron final responses now suppress blank delivery and are marked as soft failures; existing silent-delivery behavior is preserved. Focused local test passed (tests/cron/test_scheduler.py::TestSilentDelivery: 7 passed). No blockers found.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels May 18, 2026
teknium1 added a commit that referenced this pull request May 19, 2026
…28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (#28032 — x.com/status fallbacks link-like)
- colin-chang (#28245, #28249, #28251 — gateway + mattermost fixes)
- felix-windsor (#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (#28205 — charizard completion menu contrast)
- iqdoctor (#28095 — windows installer docs)
- joe102084 (#28151 — whitespace-only cron responses)
- jvinals (#27936 — Slack U-IDs → DM channel)
- maxmilian (#28267 — ModelPickerDialog portal)
- samggggflynn (#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #28352 (cherry-picked onto current main with your authorship preserved via rebase-merge — commit 6143013). Thanks for the contribution!

@teknium1 teknium1 closed this May 19, 2026
bot-ted added a commit to bot-ted/hermes-agent that referenced this pull request May 19, 2026
* fix(wecom): handle WSMsgType.CLOSING to prevent CPU spin

The WeCom adapter's _read_events() loop only handled CLOSE, CLOSED,
and ERROR websocket message types. When the server initiates a graceful
shutdown, aiohttp returns WSMsgType.CLOSING before the connection is
fully closed. This message type was not handled, causing the receive()
call to return immediately in a tight loop while self._ws.closed
remained False. The result was 100% CPU usage on the asyncio event loop.

Add WSMsgType.CLOSING to the set of terminal message types that raise
RuntimeError("WeCom websocket closed"), allowing _listen_loop() to
enter its normal reconnect backoff path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): treat empty credential pool entries as unauthenticated

Fixes #28140

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: include hermes_plugins in gateway.log component filter

gateway.log uses a _ComponentFilter that only passes records from
loggers starting with ('gateway',). Plugin modules are loaded under
the hermes_plugins.* namespace, so all plugin log output is silently
dropped from gateway.log.

This makes plugin registration — which directly affects gateway hooks
(pre_gateway_dispatch, transform_llm_output, etc.) — invisible in
the gateway-specific log. Operators debugging gateway behavior check
gateway.log and see no plugin activity, even when plugins are working
correctly.

Add 'hermes_plugins' to the gateway component prefixes tuple so
plugin log messages appear in gateway.log.

Closes #28138

* fix(gateway): align kanban artifact _IMAGE_EXTS with response dispatch

_deliver_kanban_artifacts used a broader _IMAGE_EXTS that included
.bmp, .tiff, and .svg. These three extensions are absent from the
equivalent set in _deliver_media_from_response (line 10661), which
intentionally routes them through send_document rather than
send_multiple_images (comment near line 10522 notes that Telegram
sendPhoto recompresses and rejects non-raster formats).

Routing .svg (XML text), .bmp, or .tiff through the photo API causes
send_multiple_images to raise on most platforms; the exception is caught
and logged as a warning, silently dropping the artifact. Aligning the
two sets ensures kanban deliverables with these extensions follow the
same send_document path as regular agent responses.

No behaviour change for .png/.jpg/.jpeg/.gif/.webp.

* fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze

Background process non-PTY path used stdin=subprocess.PIPE unconditionally,
creating an orphan pipe that was never written to and never closed. Child
processes that read stdin would block indefinitely, competing with the
parent's prompt_toolkit event loop for terminal ownership and causing
complete keyboard lockout.

Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin
reads instead of blocking forever. For interactive stdin, the PTY path
(which has its own independent PTY via ptyprocess.PtyProcess.spawn) should
be used instead.

Fixes #17959

* chore(release): alias stale-ID salvage commit for @LifeJiggy (#28317)

* fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze

Background process non-PTY path used stdin=subprocess.PIPE unconditionally,
creating an orphan pipe that was never written to and never closed. Child
processes that read stdin would block indefinitely, competing with the
parent's prompt_toolkit event loop for terminal ownership and causing
complete keyboard lockout.

Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin
reads instead of blocking forever. For interactive stdin, the PTY path
(which has its own independent PTY via ptyprocess.PtyProcess.spawn) should
be used instead.

Fixes #17959

* chore(release): alias stale-ID salvage commit for LifeJiggy

PR #28315 was salvaged with a wrong noreply numeric ID (192385615 vs
the correct 141562589). The commit on main is correctly authored to
LifeJiggy by username, but the noreply email doesn't match AUTHOR_MAP.
Adds an alias so release-notes generation maps both forms to the same
contributor.

---------

Co-authored-by: LifeJiggy <192385615+LifeJiggy@users.noreply.github.com>

* fix: elevate plugin discovery failures from debug to warning

Plugin discovery exceptions in gateway startup (gateway/run.py) and
CLI startup (hermes_cli/main.py) are caught and logged at DEBUG
level, making them invisible at the default INFO log level.

If any plugin import fails — syntax error, missing dependency, import
cycle — operators get zero indication unless they bump the log level
to DEBUG. This makes broken plugins appear enabled but silently
non-functional.

Change both locations to logger.warning() so failures are visible at
production log levels.

Closes #28137

* fix: treat inline-shell timeout guard as timeout

* fix(acp): resolve /tmp symlink before workspace auto-approve check on macOS

Path.resolve() follows the /tmp -> /private/tmp symlink on macOS, so
str(path).startswith("/tmp/") is always False for temp-dir paths.
The "Accept Edits" (workspace_session) mode silently refused to
auto-approve every /tmp write on macOS, breaking the documented
behaviour and making the existing test fail on this platform.

Fix: keep the raw expanded path (pre-resolve) for the /tmp prefix
check and continue using the resolved form only for the cwd
relative_to() call where symlink resolution is correct behaviour.

* fix(kanban): single-row horizontal scroll for board columns

Switch .hermes-kanban-columns from auto-fit CSS grid to a flex row with
overflow-x: auto and a hidden scrollbar (scrollbar-width / ::-webkit-
scrollbar), and pin .hermes-kanban-column to flex: 0 0 280px so columns
sit side-by-side at a fixed width instead of wrapping into a 2xN grid.

Page vertical scroll is unaffected: each column already caps at
max-height: calc(100vh - 220px), so the container never grows tall
enough to introduce its own vertical scrollbar.

* fix(approval): surface pending-approval state with explicit marker visible to LLM

When a tool call requires user approval in the non-blocking gateway path,
the LLM previously received a result that was indistinguishable from a
failed tool call (exit_code=-1, error=message). The LLM could not tell
whether the tool was pending approval, had returned empty results, or had
failed silently — causing it to burn context on wrong hypotheses.

Fix changes the result format to include:
- status: pending_approval (clear state name)
- approval_pending: True (explicit boolean for LLMs to detect)
- error: cleared to empty string (removes misleading error signal)

This lets the LLM reason about approval latency vs actual errors,
short-circuiting the previous silent failure mode.

Fixes #14806

* fix: recognize emoji and caret as natural response endings

GLM models via Ollama report finish_reason='stop' even when the
response was truncated by max_tokens. The continuation mechanism
uses _has_natural_response_ending() as one of the heuristics to
detect whether the response was genuinely finished.

Currently only ASCII punctuation and CJK punctuation are recognized.
This means any response ending with an emoji (e.g. ⚡, 👍) or the
caret character ^ (common in French ^^ smiley) is not recognized as
naturally ended, triggering a false-positive continuation where the
model receives 'Continue where you left off' and produces garbled
output.

Add:
- ^ (caret) to the punctuation set
- Unicode emoji range (codepoint >= 0x1F300) as natural ending

This only affects GLM/Ollama users but the fix is safe for all
backends since _has_natural_response_ending() is only consulted
inside the continuation flow.

* chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 8 (#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (#28032 — x.com/status fallbacks link-like)
- colin-chang (#28245, #28249, #28251 — gateway + mattermost fixes)
- felix-windsor (#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (#28205 — charizard completion menu contrast)
- iqdoctor (#28095 — windows installer docs)
- joe102084 (#28151 — whitespace-only cron responses)
- jvinals (#27936 — Slack U-IDs → DM channel)
- maxmilian (#28267 — ModelPickerDialog portal)
- samggggflynn (#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.

* fix: add pre_start() to _IncomingHandler for dingtalk SDK compatibility

The dingtalk-stream SDK calls pre_start() on every registered handler
before opening the WebSocket connection. Without this method, the SDK
raises AttributeError and kills the stream connection, causing DingTalk
to be unable to connect via Stream Mode.

* fix(windows): handle redirected stdout in _cprint fallback

Wraps _pt_print in try/except with a print() fallback. When a
kanban worker's stdout is piped to a log file, prompt_toolkit
raises NoConsoleScreenBufferError (Windows) or OSError (other)
because there is no real console buffer. The fallback keeps
worker output flowing instead of crashing.

* chore(release): alias stale-ID salvage commit for @Grogger (#28334)

PR #28330 was salvaged with a wrong noreply numeric ID (18091625 vs
the correct 7065068). The commit on main is correctly authored to
Grogger by username, but neither noreply form was in AUTHOR_MAP.
Adds both so release-notes generation maps them to @Grogger.

* fix(aux): remove stale session_search model menu entry

* fix(tui): keep x status citation fallbacks link-like

* fix(xai-oauth): quarantine dead tokens on terminal refresh failure

resolve_xai_oauth_runtime_credentials() called _refresh_xai_oauth_tokens()
with no try/except. A terminal refresh failure (HTTP 400/401/403 —
invalid_grant, token revoked) propagated without clearing the dead
access_token / refresh_token from auth.json, causing every subsequent
session to retry the same doomed network request.

Add a try/except around the refresh call that mirrors the existing
credential_pool.py quarantine: when _is_terminal_xai_oauth_refresh_error
identifies a non-retryable failure, clear the dead token fields from
auth.json and write a last_auth_error diagnostic marker so future calls
fail fast with a clear relogin_required error instead of hitting the
network.

active_provider is preserved (set_active=False) so multi-provider users
whose chosen provider is not xai-oauth are unaffected.

Tests: two new cases in test_auth_xai_oauth_provider.py cover terminal
quarantine and transient pass-through.

* feat(bg-review): add bundled/pinned skill protection rules to review prompts (#27644)

The background review prompts (_SKILL_REVIEW_PROMPT and
_COMBINED_REVIEW_PROMPT) now include explicit protection rules
for bundled, hub-installed, and pinned skills — aligning with
the curator's existing policy at curator.py L345/350.

Before this change, bg-review could freely rewrite bundled skills
like 'hermes-agent' or pinned skills, while the 7-day curator
explicitly skips them.

The review agent now sees:
  • Bundled skills (shipped with Hermes)
  • Hub-installed skills (installed via hermes skills install)
  • Pinned skills (marked via hermes curator pin)
If only protected skills need updating, the review says
'Nothing to save.' and stops.

Fixes #27644

* fix(web): portal Change Model modal so it renders above the app sidebar

The dashboard's main column is `relative z-2` (App.tsx), which creates a
stacking context that traps fixed descendants below the app sidebar
(`z-50`). `ModelPickerDialog` renders `fixed inset-0 z-[100]` inline,
so its z-100 is scoped to z-2 and the sidebar covers its left edge.

The bug is visible across all themes but only obvious in the Large theme
variants (Hermes Teal (Large), etc.) where the larger root font widens
the dialog into the sidebar's column. Toast.tsx already documents the
same trap and uses the same `createPortal(..., document.body)` escape.

This commit ports the picker; the same pattern affects other inline
z-[100] modals in the dashboard (OAuthLoginModal, Cron / Models /
Profiles page modals) and is left for a follow-up — keeping this PR
scoped to the reporter's specific case.

Fixes #28103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gateway): exit code 75 on service restart so launchd relaunches

When the gateway receives SIGUSR1 (graceful restart via launchd_restart),
the SIGUSR1 handler calls request_restart(via_service=True) and the
gateway shuts down cleanly with exit code 0.

However, the generated launchd plist uses KeepAlive → SuccessfulExit →
false, meaning launchd only relaunches on *non-zero* exit codes.  A
clean exit(0) is treated as "successful, don't restart", so the
gateway stays down after /restart, /update, or SIGUSR1.

The systemd unit template already uses RestartForceExitStatus=75 for the
same scenario.  Mirror that convention: when _restart_via_service is
True, raise SystemExit(75) so launchd's SuccessfulExit=false policy
triggers a relaunch.

Closes #28135

* fix: guard json.loads() against invalid TTS and skill_view responses

Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.

1. gateway/run.py — text_to_speech_tool() result in voice reply path.
   A TTS failure that returns an error string instead of JSON crashes
   the voice reply handler, killing the message response entirely.

2. cron/scheduler.py — skill_view() result when loading skills for
   cron jobs. A corrupted or missing skill file that returns an error
   string instead of JSON crashes the cron tick, preventing all jobs
   from executing that cycle.

Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.

* fix(gateway): bridge gateway_restart_notification from YAML platform sections

Two related bugs in gateway/config.py prevented per-platform
gateway_restart_notification from working through config.yaml:

1. The shared-key bridging loop (load_gateway_config) omitted
   'gateway_restart_notification', so the key never landed in
   platform_data['extra'] even when set under e.g. 'discord:' or
   'mattermost:' sections.

2. PlatformConfig.from_dict() only read gateway_restart_notification
   from the top-level data dict, ignoring the 'extra' sub-dict where
   bridged keys are stored.

Fix: add the key to the bridging loop, and add an 'extra' fallback
in from_dict() so that round-tripped values (YAML → bridged → extra
→ from_dict) resolve correctly.

Impact: users can now set gateway_restart_notification: false per
platform in config.yaml instead of relying on env vars or the
global platforms: block.

* feat(kanban): add auto_promote_children config toggle

When the kanban auto-decomposer fans a triage task into child tasks,
recompute_ready() immediately promotes parent-free children to 'ready'
so the dispatcher picks them up. Some users want a manual workflow
where children stay in 'todo' for review before dispatch.

Add 'kanban.auto_promote_children' config key (default: true):
- false: children stay in 'todo' after decomposition
- true: existing behavior (auto-promote to 'ready')

Changes:
- kanban_db.py: decompose_triage_task() gains auto_promote param
- kanban_decompose.py: reads auto_promote_children from config
- kanban dashboard API: exposes the new setting in GET/PUT /orchestration

Closes #28016

* fix: wrap _pool_may_recover_from_rate_limit call through run_agent namespace

The conversation_loop.py references _pool_may_recover_from_rate_limit which
was defined in run_agent.py. After the conversation-loop extraction refactor,
the helper was no longer in the same module scope. Wrap the call as
_ra()._pool_may_recover_from_rate_limit() to route through the run_agent
monkeypatch namespace where the helper is available.

Adds regression test in test_gemini_fast_fallback.py.

Fixes: MAILROOM Email Triage NameError, OPS Execution Monitor NameError.

* fix(tui): improve charizard completion menu contrast

* docs(windows): avoid piping installer directly into iex

* fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS

Qwen3.x and DeepSeek-V3.x default to chatty/hallucinatory tool use without
enforcement steering — agents narrate "calling tool X" without actually
emitting a tool call, or run partial loops. Both model families fit the
same failure pattern TOOL_USE_ENFORCEMENT_GUIDANCE was already injected
for (gpt, codex, gemini, gemma, grok, glm).

Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>

Squashed salvage of:
- 403e567ce fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS
- 9433eabe7 test(agent): use realistic qwen-plus identifier in enforcement test

Fixes #28079.

* fix(send_message): resolve Slack user IDs to DM channel IDs

The _SLACK_TARGET_RE regex only matched IDs starting with C (channel),
G (group), or D (direct message). Slack user IDs start with U, causing
'Could not resolve' errors when trying to send DMs to specific users.

Changes:
- Expand _SLACK_TARGET_RE to accept U-prefixed IDs (user IDs)
- Add conversations.open fallback to resolve user IDs to DM channel
  IDs before sending, since chat.postMessage requires a conversation ID

Fixes #ISSUE_NUMBER

* fix(gateway): tighten MEDIA extraction regex + silent skip on file-not-found

Three related fixes for the MEDIA:<path> extraction pipeline that
caused 'file not found' noise in platform channels:

1. run.py — tighten tool-result MEDIA regex from \S+ (any non-
   whitespace) to require a path pattern with known extensions.
   Prevents LLM-generated placeholder paths like
   'MEDIA:/path/to/example.mp4' from being captured as real media.

2. base.py — remove the |\S+ fallback in extract_media() that
   catches anything non-whitespace as a potential MEDIA path.
   This was the primary cause of false positives — strings like
   '' in tool output were captured as MEDIA: paths.

3. mattermost.py — replace the file-not-found error message sent
   to the channel with a silent logger.warning() skip. When a
   path extracted by MEDIA doesn't exist on disk, the channel
   no longer gets a noisy '(file not found: ...)' message.

Impact: eliminates the persistent 'file not found' spam in
Mattermost channels caused by over-broad MEDIA regex patterns
matching non-path text in tool output.

* fix(xai-oauth): split 403 (tier/entitlement) from 400/401 in token endpoint

xAI's token endpoint returns HTTP 403 to the OAuth grant when the
account isn't on the allowlist for API access (e.g. standard
SuperGrok subscribers — see #26847). Treating it like a stale-token
400/401 made ``format_auth_error`` append "Run ``hermes model`` to
re-authenticate", which is misleading because re-login can't change
xAI's tier decision.

Split 403 off in both ``refresh_xai_oauth_pure`` and the loopback
login token exchange:

* New error code ``xai_oauth_tier_denied`` with ``relogin_required=False``
* Message explains the entitlement gate and points at the
  ``XAI_API_KEY`` + ``provider: xai`` fallback
* 400/401 still set ``relogin_required=True`` as before
* 5xx still set ``relogin_required=False`` as before

* fix(run-agent): treat any 403 on xai-oauth as entitlement to stop refresh-loop

The existing ``_is_entitlement_failure`` heuristic only fires when
the response body contains specific substrings ("do not have an
active Grok subscription", etc.). xAI has been seen to 403 standard
SuperGrok subscribers with a terser body that doesn't match those
keywords (#26847), and the recovery path would then mint a fresh
token, get a fresh 403, and loop until Ctrl+C.

Add a defense-in-depth check at the recovery call site: any 403 on
``provider == "xai-oauth"`` short-circuits ``try_refresh_current``
so the error surfaces immediately with the friendly hint from
``_summarize_api_error``. Keeps the existing keyword path for all
other providers untouched.

* test(xai-oauth): pin tier-denied 403 behavior + docs warning for #26847

Tests:

* ``test_refresh_xai_oauth_pure_403_marked_tier_denied_not_relogin`` —
  refresh-403 raises ``xai_oauth_tier_denied`` with
  ``relogin_required=False`` and the API-key fallback hint in body.
* ``test_format_auth_error_tier_denied_does_not_suggest_relogin`` —
  the renderer does not append "Run ``hermes model``" for the new
  code.
* ``test_recover_with_credential_pool_skips_refresh_on_bare_403_for_xai_oauth`` —
  bare ``{"reason":"forbidden","message":"Forbidden"}`` body (which
  does not match the existing keyword heuristic) still short-circuits
  ``try_refresh_current`` on xai-oauth.

Docs:

* Drop the "(any active tier)" claim from the xai-grok-oauth guide,
  add a top-of-page warning callout, and a Troubleshooting section
  for the 403-after-login case pointing at ``XAI_API_KEY`` +
  ``provider: xai`` as the documented fallback.

* fix: handle whitespace-only cron responses

* fix(cli): preserve cron asterisks in strip mode

* fix(mattermost): resolve thread root_id and route progress to threads

Two Mattermost thread-related bugs:

1. _resolve_root_id() — Mattermost CRT requires root_id to be the
   thread root post. Using any reply's own ID as root_id causes
   '400 Invalid RootId'. Add _resolve_root_id() that walks up the
   post chain via API to find the actual root, and apply it in
   send(), _send_url_as_file(), and _send_local_file().

2. _progress_reply_to — The condition in run.py only checked
   Platform.FEISHU, missing Mattermost entirely. This caused tool
   progress messages to always land in the main channel instead of
   the thread. Add Platform.MATTERMOST to the condition so
   progress messages are routed to threads when reply_mode=thread.

Impact: Tool progress messages now appear in Mattermost threads
instead of flooding the main channel; thread replies no longer
fail with Invalid RootId when the reply target is itself a reply.

* feat(kanban): archive --rm to hard-delete archived tasks

Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to
permanently remove already-archived tasks with cascading cleanup of
links, comments, events, runs, and notify-subs. Safety guard: only
archived tasks can be deleted; active/blocked/done must be archived
first.

Cherry-picked from #19964 onto current main (severe stale base, applied
manually to preserve substance only).

* feat(proxy): add xai upstream adapter for Grok via OAuth

* chore(release): map @yannsunn for PR #28064 xai proxy adapter salvage

* docs(skill): align kanban dispatcher failure_limit text with current default

* fix(oauth): add manual-paste fallback for browser-only remote consoles

xAI Grok OAuth (and Spotify) use a loopback redirect to
``http://127.0.0.1:<port>/callback`` to capture the authorization
code. That works when the browser and Hermes run on the same
machine, and the SSH tunnel recipe handles the regular remote
case. It breaks completely on **browser-only remote consoles**
(GCP Cloud Shell, GitHub Codespaces, AWS EC2 Instance Connect,
Gitpod, Replit, …) where the user has a browser but no real SSH
client to forward a port — the redirect to 127.0.0.1 on the
remote VM simply isn't reachable from the laptop, and there's
nothing the existing flow can do about it (#26923).

This commit adds the foundation for a manual-paste fallback:

* ``_is_remote_session`` now also recognises Cloud Shell,
  Codespaces, Gitpod, Replit, StackBlitz (in addition to SSH),
  so the existing tunnel hint at least fires in those
  environments.
* ``_parse_pasted_callback`` accepts any of: a full
  ``http(s)://...?code=...&state=...`` URL, a bare ``?code=...``
  query string, a bare ``code=...&state=...`` fragment, or a
  bare opaque code value.  Returns the same dict shape the HTTP
  callback handler produces, so the caller's state / error
  validation works unchanged (no CSRF bypass).
* ``_prompt_manual_callback_paste`` reads stdin with a clear
  multi-line explanation of what's happening and what to paste.
* ``_xai_oauth_loopback_login`` gains a ``manual_paste`` kwarg
  that skips the HTTP listener entirely.  The redirect_uri,
  PKCE verifier, state, and nonce are byte-identical to the
  loopback path so xAI's token endpoint can't tell the
  difference at the protocol level.
* ``_print_loopback_ssh_hint`` now also mentions
  ``--manual-paste`` so users without a real SSH client see a
  path forward instead of a dead-end tunnel recipe.
* ``_login_xai_oauth`` threads ``args.manual_paste`` into the
  loopback helper.

* feat(cli): wire --manual-paste into ``hermes auth add`` and ``hermes model``

Register the new ``--manual-paste`` flag on both entry points and
thread it through to the xAI loopback login:

* ``hermes auth add xai-oauth --manual-paste`` — pool-add path,
  forwarded inside ``auth_commands.handle_auth_add``.
* ``hermes model --manual-paste`` — model-picker path, forwarded
  by ``_model_flow_xai_oauth`` into the synthetic ``argparse.Namespace``
  it passes to ``_login_xai_oauth``.  The picker also now forwards
  ``--no-browser`` and ``--timeout`` for consistency (previously
  hardcoded to defaults regardless of CLI flags).

Help text on both flags points at #26923 and names the
browser-only remote consoles (Cloud Shell, Codespaces, EC2
Instance Connect) so users searching ``hermes --help`` can find
the workaround.

* test+docs(oauth): pin manual-paste semantics and document browser-only path (#26923)

Tests (``tests/hermes_cli/test_auth_manual_paste.py``):

* 9 parametrised + scalar cases for ``_is_remote_session`` covering
  the new Cloud Shell / Codespaces / Gitpod / Replit / StackBlitz
  env vars (plus the existing SSH ones).
* 9 cases for ``_parse_pasted_callback`` covering every paste form
  (full URL, https URL with extra params, bare ``?code=...``, bare
  ``code=...`` fragment, bare opaque value, error+description,
  empty, whitespace-only, malformed URL).
* 3 cases for ``_prompt_manual_callback_paste`` (happy path, EOF,
  Ctrl-C).
* 3 end-to-end ``_xai_oauth_loopback_login(manual_paste=True)``
  cases: the HTTP server MUST NOT be started (asserted via a
  callable that raises if invoked), wrong state still rejected
  with ``xai_state_mismatch`` (no CSRF bypass), and empty paste
  surfaces ``xai_code_missing``.
* SSH-hint mention test ensures the ``--manual-paste`` instruction
  is printed in the remote-session hint.

Docs:

* ``oauth-over-ssh.md`` — new "Browser-only remote (Cloud Shell /
  Codespaces / EC2 Instance Connect)" section with the
  ``--manual-paste`` recipe, plus a TL;DR note for the new flag.
* ``xai-grok-oauth.md`` — short subsection pointing at the same
  recipe and the OAuth-over-SSH guide anchor.

* docs(kanban): document max-retries task override

* docs(kanban): document inline create shortcuts

* test(kanban): cover default board dashboard pin

* docs: ignore box diagrams in ascii guard

Wrap existing box-drawing diagrams with ascii-guard markers so docs-site checks pass when website docs are touched.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: per-task model override for kanban workers

- Add model_override field to Task class and tasks schema
- Add migration for existing databases
- Spawn worker with -m model when model_override is set

* test(kanban-dashboard): cover _task_dict task_age fallback

The fix in 061a1830 added an outer try/except in plugin_api._task_dict
so that a future failure mode in kanban_db.task_age (anything _safe_int
doesn't already absorb) cannot 500 the GET /board response. The
_safe_int / task_age corruption paths got regression coverage in
tests/hermes_cli/test_kanban_db.py, but the OUTER fallback contract
remained untested -- meaning a refactor that drops the try/except would
not be caught by CI.

Pin that contract from both consumers of _task_dict:
- GET /board returns 200 with the literal fallback age dict for the
  affected card (other cards continue to render via the same path)
- GET /tasks/:id (drawer view) returns 200 with the same fallback,
  so a single corrupt task can't block its own drawer

Both tests force task_age to raise RuntimeError rather than ValueError
on '%s', because ValueError is absorbed by _safe_int and never reaches
the outer try/except -- testing that path would only re-cover what
test_kanban_db.py already pins.

Manually verified the regression discipline:
  git checkout 061a1830^ -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both FAIL with 500
  git checkout HEAD -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both PASS

* fix(kanban): clear _INITIALIZED_PATHS in remove_board so recycled DBs re-init schema

Archiving or deleting a board via remove_board() leaves the path's
"schema already initialized" entry in the module-level cache. A
concurrent connect(board=<slug>) call (e.g. the dashboard event-stream
poll loop) then:

  1. resolves the same kanban.db path,
  2. recreates the directory + an empty sqlite file because
     connect() does mkdir(parents=True, exist_ok=True),
  3. skips the CREATE TABLE pass because the cache entry says the
     schema is already in place,
  4. errors on the next read with `no such table: task_events`.

Drop the cache entry before mutating the filesystem so the fresh file
gets a proper schema init on next connect(). Applies to both
archive=True (rename) and archive=False (rmtree) branches.

Fixes #23833.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(web): add Cache-Control: no-store to plugin static file serving

Prevents browser caching of stale dashboard plugin JS files that may
contain bugs already fixed upstream (e.g. COLUMN_LABEL undefined).

* fix(kanban): seed bundled skills (e.g. kanban-worker) on kanban init

Closes #23725

* fix(kanban): ignore stale HERMES_KANBAN_BOARD for removed boards

* fix(kanban): keep board-management commands independent from board override

* fix(kanban): preserve notifier_profile for dashboard home subscriptions

* fix(kanban): promote dependents when a parent is archived

* fix(cli): make kanban specify max_tokens configurable

* fix(kanban): sync slash subcommands with live parser

* fix(kanban): promote blocked tasks when parent dependencies complete

recompute_ready only scanned 'todo' tasks for promotion, ignoring
'blocked' tasks entirely. When a task was blocked (e.g. by the circuit
breaker) and its parent dependencies later completed, the task stayed
stuck in 'blocked' forever unless manually unblocked.

Now recompute_ready also scans 'blocked' tasks. When all parents are
done/archived, the blocked task is promoted to 'ready' with failure
counters reset — equivalent to an automatic unblock.

Includes a regression test for the blocked-parent-done promotion path.

* fix(kanban): use 'is not None' check for max_runtime_seconds in create_task

max_runtime_seconds=0 was being silently coerced to None due to a falsy
check (if max_runtime_seconds). Zero is a valid value that causes the
dispatcher to immediately time out a task. The adjacent max_retries
parameter already used the correct 'is not None' pattern.

Fixes the inconsistency by aligning max_runtime_seconds with max_retries.

* fix(kanban): reset failure counters on unblock_task

When a task is manually unblocked (blocked → ready/todo), the
consecutive_failures counter and last_failure_error were left intact.
The next failure would immediately re-trip the circuit breaker because
the counter was still at or above the failure limit.

Reset both fields on unblock so the task gets a fresh retry budget.

Includes a regression test that verifies counters are zeroed.

* fix(kanban): fingerprint crash errors to prevent fleet-wide retry exhaustion

When a systemic failure (provider outage, auth expiry, OOM) crashes
multiple workers simultaneously, detect_crashed_workers increments
each task failure counter independently. The circuit breaker only
trips after N × failure_limit retries across the fleet.

Fingerprint crash errors by normalizing host-specific details (PIDs,
timestamps). When 3+ tasks crash with the same fingerprint in a
single detection cycle, immediately trip the circuit breaker
(failure_limit=1) instead of waiting for repeated failures.

Isolated crashes (unique fingerprints) retain their normal retry
budget. Protocol violations continue to trip immediately.

Includes regression tests for systemic and isolated crash paths.

* fix(kanban): align board_exists with board discovery rules

* fix(kanban): demote ready children when a parent is reopened

* fix(kanban): serialize DB initialization

* fix(kanban): task_age() tolerates ISO-8601 timestamps

Prevents ValueError crash in dashboard get_board() when a task has
an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch
int. Adds _to_epoch() helper that normalises both formats.

* Fix Kanban dashboard initial board selection

* fix(kanban): persist worker session metadata on completion

Salvages #25579 by @wesleysimplicio. Stamps task_runs.metadata.worker_session_id
from HERMES_SESSION_ID on kanban_complete. Cherry-picked the substantive
commit (not the AUTHOR_MAP fixup tip) onto current main.

* fix(kanban): make claim ttl configurable

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* fix(kanban): pass accept-hooks to worker chat subprocess

* feat(kanban): add board-level default workdir (#25430)

* docs(kanban-worker): document notification routing configuration

* fix(kanban): preserve worker tools with restricted toolsets

* fix(kanban): make legacy task migration idempotent

(cherry picked from commit 293f1c3a7241b0117669e049d9aa746c9645ac90)

* fix: harden Kanban worker Hermes command resolution

* feat(kanban): allow trimmed task comments

SS-1647 live SHIP validation: real code + tests for kanban comment --max-len.

* fix: show scheduled kanban tasks in dashboard

* fix: assign single-task kanban decompositions

* fix(kanban-dashboard): make Orchestration mode checkbox label static

The checkbox label echoed its state ("Auto (default)" / "Manual") instead
of describing the action, so a checked box reading "Auto" parsed as a
status indicator rather than a control. The accompanying sub-description
was also static and started with "When on, ...", which read awkwardly
when the box was unchecked.

Replace the dynamic label with a static action label
("Auto-decompose triage tasks") and flip the sub-description between the
two modes so it stays accurate either way. The top-of-page Orchestration
pill is unchanged — that one is intentionally a status badge / toggle.

Fixes #28178

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(env): add HERMES_KANBAN_DISPATCH_IN_GATEWAY override (#21956)

Salvages the env-vars docs portion of #21956 by @Bartok9.
The ascii-guard-ignore tags from the original PR already landed on main.

* fix(kanban): close sqlite connection on init failure to prevent fd leak

Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema
init raises after sqlite3.connect() succeeds, the new connection was
leaking. Wrap the body in try/except so the connection is closed before
the exception propagates.

* fix(kanban): don't crash dispatched workers when kanban-worker skill is absent

Salvages #27372 by @oemtalks. The dispatcher unconditionally injected
`--skills kanban-worker` into every worker spawn, but worker profiles
sometimes don't have that bundled skill in their skills dir, which is
fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`).

Adds `_kanban_worker_skill_available(hermes_home)` and only injects the
flag when the skill resolves. The MANDATORY lifecycle still ships via
KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.

* fix(packaging): ship dashboard plugin assets in wheel

Salvages #23737 by @LeonSGP43. Adds plugins/* manifest.json and dist/
glob entries to setuptools package-data so wheel installs ship the
bundled dashboard plugin assets (kanban, achievements, etc.). Without
these, /api/dashboard/plugins can't discover plugin assets outside a
source checkout.

* docs(kanban): document worker protocol auto-blocks

Salvages #21585 by @helix4u. Documents the protocol_violation event
(worker exits successfully while task is still running), adds
--max-retries to the create flag list and --failure-limit to dispatch.

* fix(oneshot): pass fallback_providers from profile config to AIAgent

Salvages #23368 by @uzunkuyruk. Oneshot workers (e.g. kanban workers
spawned via 'hermes -p <profile> chat -q ...') were not honouring the
profile's fallback_providers / fallback_model chain because oneshot.py
never read the config and never passed fallback_model= to AIAgent.

Reads cfg.get('fallback_providers') (new list format) or
cfg.get('fallback_model') (legacy single-dict) with the same
normalization cli.py applies, then forwards as fallback_model=_fb.

* fix(kanban): reject direct running transitions in dashboard bulk updates

Salvages #24050 by @kronexoi. The single-task PATCH already rejects
direct status='running' since it bypasses the dispatcher/claim invariant,
but the bulk-update endpoint still accepted it. Aligns bulk with single
by emitting an error result row for any 'running' entry.

* feat(kanban): add initial-status for human-ops cards

Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag
(running|blocked, default running) to 'kanban create', threaded through
kanban_db.create_task() and the kanban_create tool schema. 'blocked'
parks the task directly in the blocked column for R3 human-ops review,
skipping the brief running-to-blocked transition.

Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and
slash-handler error formatting changes that were bundled in the
original PR — those should ship as their own focused changes if still
wanted.

* fix(kanban): release scratch workspace and tmux session on task completion

Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace()
and _cleanup_worker_tmux() after marking a task complete.

Scratch workspaces (used by swarm agents) accumulate on disk — hundreds
of MB per task, never released. Stale tmux sessions from completed
agents also persist indefinitely.

Both gates are safe:
- workspace_kind == 'scratch' gate preserves user worktree/dir workspaces
- tmux #{pane_dead} == 1 gate only kills sessions where the worker has
  already exited
- best-effort: cleanup failures never block task completion

* fix(kanban): honor severity thresholds in diagnostics

Salvages #26431 by @LeonSGP43. Dashboard plugin_api list_diagnostics
was using exact-match (severity == filter), so '--severity warning'
hid 'error' and 'critical' diagnostics. Adds severity_at_or_above()
helper to kanban_diagnostics and uses it in the dashboard endpoint
(CLI already used SEVERITY_ORDER comparison correctly).

* test: isolate Kanban env pins in hermetic fixture

Salvages the substantive part of #22295 by @steezkelly. Adds the
missing HERMES_KANBAN_HOME, HERMES_KANBAN_RUN_ID, HERMES_KANBAN_CLAIM_LOCK,
HERMES_KANBAN_DISPATCH_IN_GATEWAY entries to _HERMES_BEHAVIORAL_VARS so
ambient developer-shell pins on those vars don't bleed into pytest runs.

The frozenset extraction + standalone regression test from the original
PR were dropped to keep the change minimal — main already maintains the
list inline.

* feat(kanban): add max_in_progress config to cap concurrent running tasks

Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config
that caps simultaneously running tasks. When the board already has N
running, dispatcher skips spawning so slow workers (local LLMs,
resource-constrained hosts) don't pile up and time out.

Threads through dispatch_once(max_in_progress=) and gateway dispatcher
config parsing with validation (warns on invalid/below-1 values).

* fix(packaging): ship bundled skills in wheel

Salvages #23738 by @LeonSGP43. Wheel installs were missing skills/ and
optional-skills/ because pyproject's [tool.setuptools.packages.find]
only includes Python packages — the skills directories don't have
__init__.py so they were silently dropped from the wheel.

Adds setup.py with data_files spec emitting skills/* and optional-skills/*
under hermes_agent-<v>.data/data/, and a get_bundled_skills_dir() helper
in hermes_constants that discovers the wheel-installed location via
sysconfig before falling back to a source-checkout path. tools/skills_sync
uses the helper so 'hermes update' works for pip-installed users.

* fix: 4 small surgical bugs

Salvages #23302 by @Bartok9. Four independent one-area fixes:

1. kanban boards delete alias now hard-deletes (not archives) — the
   alias didn't carry --delete, so getattr(args, 'delete', False)
   returned False. Detect boards_action=='delete' explicitly.
2. Gateway auto-title failures no longer leak as user-visible
   warnings — debug-log only since they're not actionable.
3. Background process completion notification snaps truncation to
   the next newline boundary, prepends a marker when content is
   dropped.
4. _cprint() schedules the run_in_terminal coroutine via
   asyncio.ensure_future so output isn't silently dropped from
   background threads (fixes #23185 Bug A). Skips the
   double-print fallback that would fire for mock paths.

* perf(prompt): cache kanban worker guidance at session init

Salvages #24402 by @RyanRana. The KANBAN_GUIDANCE block (~835 tokens)
is session-static — the dispatcher decides at spawn time whether the
process is a kanban worker via the kanban_show tool's check_fn (gated
on HERMES_KANBAN_TASK env var). Re-checking 'kanban_show' in
valid_tool_names and re-loading the reference on every system-prompt
rebuild (init + each context compression) is wasted work.

Caches the resolved string on agent._kanban_worker_guidance once in
agent_init and consumes it in system_prompt.build_system_prompt(),
with a getattr fallback for code paths that bypass agent_init.

* feat(kanban): add --sort option to 'hermes kanban list'

Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc,
priority,priority-desc,status,assignee,title,updated} to 'hermes kanban
list'. Validated against VALID_SORT_ORDERS map; invalid values raise
ValueError. Default behaviour (priority DESC, created ASC) is unchanged
when --sort is omitted.

* docs: add kanban codex lane skill

* feat(kanban): worker visibility endpoints (workers/active, runs/{id}, inspect)

Adds three read-only endpoints to the kanban dashboard plugin so the
SwitchUI workspace (and any other dashboard consumer) can track
workers across tasks without N+1 round-trips through /tasks/{task_id}.

- GET /workers/active
  Single SQL JOIN of task_runs + tasks where ended_at IS NULL,
  worker_pid IS NOT NULL, status='running'. Returns
  {workers: [...], count, checked_at}.

- GET /runs/{run_id}
  Direct lookup of any task_run row by id. Reuses existing
  kanban_db.get_run() helper and _run_dict() serialiser. 404 when
  not found. Mirrors GET /tasks/{task_id} 404 shape.

- GET /runs/{run_id}/inspect
  Live PID stats via psutil.Process.as_dict() — cpu_percent,
  memory_rss_bytes, memory_vms_bytes, num_threads, num_fds, status,
  create_time, cmdline. Short-circuits with alive:false when run
  has ended, has no worker_pid, the pid is gone, or psutil is
  unavailable. AccessDenied surfaces as alive:true with error
  rather than a 500.

11 new tests in tests/plugins/test_kanban_worker_runs.py cover the
empty-board case, running-task case, ended-run filtering,
missing-pid filtering, 404 paths, already-ended inspect, no-pid
inspect, dead-pid inspect, and live-pid inspect (psutil mocked).
All pass.

Companion termination endpoint (POST /runs/{run_id}/terminate) is
intentionally out of scope here — opening a separate issue first
since the RBAC and dispatcher-mediated soft-cancel design needs
maintainer input before code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): map contributor email for attribution check

* test(kanban-dashboard): pin enriched 409 detail and inline error wiring (#26744)

- Existing ``test_patch_drag_drop_move_todo_to_ready`` now asserts the
  enriched 409 detail names the blocking parent (id, quoted title, and
  current status), so the dashboard always has something actionable to
  render.
- New bundle-assertion test ``test_dashboard_surfaces_ready_blocked_error_inline``
  pins the frontend wiring: the ``parseApiErrorMessage`` helper exists,
  the drag/drop banner runs through it, and the drawer maintains a
  visible ``patchErr`` state that's cleared between PATCHes and tasks.

* docs(codex_app_server): document multi-root Kanban writable_roots (#27941)

Update the Codex app-server runtime guide's Kanban section to reflect
the new behaviour:

  * The sandbox override now adds the board DB directory plus every
    Kanban path the dispatcher pinned (HERMES_KANBAN_WORKSPACES_ROOT,
    HERMES_KANBAN_WORKSPACE, legacy HERMES_KANBAN_ROOT) -- deduplicated,
    DB-dir first.
  * The motivation note now includes the cross-mount artifact-write
    scenario (e.g. ``/media/.../kanban-workspaces/...`` on a separate
    drive) and links to issue #27941 so readers can find the original
    bug report.

* fix(gateway): quiet corrupt kanban dispatcher boards

Salvages substantive part of #26490 by @aqilaziz. Detects corrupt board
DBs ("file is not a database" / "database disk image is malformed")
and disables them by fingerprint until they're repaired, instead of
flooding the gateway log with repeated logger.exception tracebacks every
tick.

Cherry-picked the substantive commit (ea5b4ec2a); the tip commit was
an unrelated _is_dir OSError fix for service-path lookup. Dropped a
small test reformat that was bundled in the same commit.

* docs: align kanban readiness docs and smoke tests

Salvages #28199 by @bensargotest-sys. Aligns Kanban docs with current
tool registration: dispatcher-spawned task workers get task tools,
profiles that explicitly enable the kanban toolset get orchestrator
routing tools (kanban_list, kanban_unblock). Corrects failure-limit
text to current default of 2. Hardens the e2e subprocess script to
resolve repo root and use the spawnable default assignee. Updates the
diagnostics severity fixture to assert error below the critical
threshold.

* feat(kanban): surface per-task model_override in show + tool output

Salvages #26897 by @loicnico96. The per-task model_override DB column
already exists on main, but it wasn't exposed in user-facing surfaces.
This adds:
- 'kanban show' prints 'model: <name>' when model_override is set
- kanban_show / kanban_list tool responses include the model_override field

Original branch was stale (PR was authored against an older field name
'model'); applied the substantive surface exposure manually using the
current 'model_override' field name.

* feat(cli): add kanban swarm topology helper

Salvages #26791 by @Niraven. Adds 'hermes kanban swarm' to create a
durable Kanban Swarm v1 graph: a completed root/blackboard card,
parallel worker cards, a verifier gated on all workers, and a
synthesizer gated on the verifier. Stores shared swarm blackboard
updates as structured JSON comments on the root card.

Self-contained: new hermes_cli/kanban_swarm.py module + CLI wiring +
unit tests.

* feat(kanban): add optional board parameter to all MCP tools

Salvages #27598 by @nnnet. Adds optional 'board' parameter to all 9
kanban_* MCP tools via shared _connect helper. Backwards compatible —
omitting board keeps current pinned-board behavior. Useful for
orchestrator profiles that route across multiple boards.

Two-file scope: tools/kanban_tools.py + tests.

* feat(kanban): stamp originating ACP session_id on tasks

Salvages #23208 by @awizemann. Tracks which chat session created a
kanban task so clients can render a per-session board without falling
back to tenant + time-window heuristics.

- Schema: tasks gains nullable session_id TEXT column with index
  (additive migration in _migrate_add_optional_columns).
- ACP: server.py exposes the originating session id via HERMES_SESSION_ID
  with save/restore around the agent loop.
- Tool: kanban_create reads HERMES_SESSION_ID (with explicit override).
- CLI: 'hermes kanban list --session <id>' filter; JSON output exposes
  session_id.

* feat(kanban): wire dispatcher to dispatch review agents from review column

Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task
status and extends dispatch_once to monitor the review column as a
second dispatch source (in addition to the existing ready column).

- Adds 'review' to VALID_STATUSES
- Adds claim_review_task() — atomically transitions review → running
- Adds has_spawnable_review() — health telemetry mirror
- Extends dispatch_once with a review column dispatch loop
- Review agents get 'sdlc-review' skill auto-loaded

Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state,
test file additions). Adapted claim_review_task to main's
ttl_seconds: Optional[int] = None convention (matches claim_task).

* feat(kanban): stale detection for running tasks in dispatcher

Salvages #23790 by @thewillhuang. Adds detect_stale_running() to
the dispatcher cycle. Running tasks that have been started for longer
than dispatch_stale_timeout_seconds (default 14400 = 4h) without a
heartbeat in the last hour are auto-reclaimed to ready.

- New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables)
- New 'stale' field on DispatchResult
- detect_stale_running() in kanban_db.py with heartbeat freshness check
- Records outcome='stale' on run close + 'stale' event; ticks failure counter
- Wires config through gateway embedded dispatcher
- Updates _cmd_dispatch verbose/JSON output and daemon logging

Resolved test-file end-of-file conflict by appending both halves.

* feat(kanban): filter tasks by workflow fields and runs by status/outcome

Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing
workflow_template_id and current_step_key columns:

- list_tasks() accepts workflow_template_id and current_step_key kwargs
- 'hermes kanban list' adds matching CLI flags
- dashboard plugin_api also exposes the filters

Resolved a small conflict in list_tasks signature alongside main's
session_id and order_by additions; combined all three into the single
filter list.

* feat(kanban): add respawn guard to block repeat worker storms

Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker
spawn for tasks where:
- a recent run already succeeded (recent_success — within guard window)
- the previous run hit a quota/auth error (blocker_auth, also auto-blocks)
- a recent task comment includes a GitHub PR URL (active_pr)

The guard prevents repeat worker storms on the same bug/task. Includes
the contributor's review-findings fixup (regex hardening, observability,
auth coverage).

Resolved a small DispatchResult conflict alongside main's 'stale' field;
kept both. Authorship preserved via rebase merge.

* feat(kanban): show dashboard cron jobs across profiles

Salvages #27568 by @SerenityTn. Dashboard cron page now lists cron
jobs from all profiles, with profile-aware filter UI and storage
routing. Includes test coverage for cross-profile listing, mutation,
deletion, and validation.

Also fixes orphan conflict markers in config.py left by an earlier
salvage merge (kanban.dispatch_stale_timeout_seconds was double-nested
in HEAD/PR markers from #28452 salvage of #23790).

* fix(kanban): remove orphan conflict markers from config.py (#28458)

PR #28452 (salvage of #23790, stale detection) merged with leftover
git conflict markers in hermes_cli/config.py around the
`dispatch_stale_timeout_seconds` config block, breaking config import
and any code path that loads it. Cleans up the markers and keeps both
config blocks (worker log rotation/orchestrator + stale detection).

Resolves a self-introduced regression.

* fix(kanban): remove orphan conflict markers from kanban.py (#28459)

PR #28454 (salvage of #26745, workflow filter) merged with leftover
git conflict markers in hermes_cli/kanban.py at three sites:
- _task_to_dict() (session_id alongside workflow_template_id/current_step_key)
- p_list parser (--sort alongside --workflow-template-id/--step-key)
- _cmd_list (order_by alongside the new filter kwargs)

Cleans up the markers and keeps both halves at each site.

Resolves a self-introduced regression.

* feat(kanban): configure worktree paths and branches

Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so
tasks with workspace_kind='worktree' can pin a target branch on
create. Schema migration added to _migrate_add_optional_columns.

- Task.branch_name field + DB column + migration
- create_task accepts branch_name kwarg
- hermes kanban create --branch <name> flag
- kanban show output includes 'Branch: <name>' when set

Cherry-picked the substantive commit (a7558cf27); the PR's tip was
an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list
and show-output conflicts alongside main's session_id and
max_runtime_seconds additions; kept all three.

* feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373)

Skill bundles are tiny YAML files in ~/.hermes/skill-bundles/ that
group several skills under one slash command. Invoking /<bundle-name>
from any surface (CLI, TUI, dashboard, any gateway platform) loads
every referenced skill into a single combined user message.

Use cases:
- /backend-dev → loads github-code-review + test-driven-development
  + github-pr-workflow as one bundle.
- /research → loads several research skills together.
- Team task profiles shared via dotfiles.

Behavior:
- Bundles take precedence over individual skills when slugs collide.
- Missing skills are skipped with a note, not fatal.
- No system-prompt mutation — bundles generate a fresh user message
  at invocation time, the same way /<skill> does. Prompt cache stays
  intact.
- Works in CLI dispatch, gateway dispatch, autocomplete (CLI + TUI),
  /help display.

Schema (~/.hermes/skill-bundles/<slug>.yaml):
    name: backend-dev
    description: Backend feature work.
    skills:
      - github-code-review
      - test-driven-development
    instruction: |
      Optional extra guidance prepended to the loaded skills.

New module: agent/skill_bundles.py — load, scan, resolve, build
invocation message, save, delete. yaml.safe_load only; broken
bundles log a warning and are skipped, never raise.

New CLI subcommand: hermes bundles {list,show,create,delete,reload}.
Implementation in hermes_cli/bundles.py; wired in hermes_cli/main.py.
'bundles' added to _BUILTIN_SUBCOMMANDS so plugin discovery skips it.

New in-session slash command: /bundles lists installed bundles in
both CLI and gateway. /<bundle-name> dispatch added to CLI (cli.py)
and gateway (gateway/run.py) before the existing /<skill-name> path.

Autocomplete: SlashCommandCompleter gained an optional
skill_bundles_provider parameter that defaults to None — the prompt
shows '▣ <description> (N skills)' for bundles vs '⚡' for skills.

Tests:
- tests/agent/test_skill_bundles.py — 33 tests covering slugify,
  scan/cache freshness, resolve (including underscore→hyphen
  Telegram alias), build_bundle_invocation_message (loading, missing
  skills, user/bundle instruction injection, dedup), save/delete,
  reload diff, list sort.
- tests/hermes_cli/test_bundles.py — 8 tests for the CLI
  subcommand (create/list/show/delete/reload, --force, missing
  bundle errors).
- tests/gateway/test_bundles_command.py — 4 tests for the gateway
  handler and bundle resolution priority.

Live E2E: verified subprocess invocations of hermes bundles
{list,create,show,reload,delete} round-trip correctly against an
isolated HERMES_HOME.

Docs:
- website/docs/user-guide/features/skills.md — new 'Skill Bundles'
  section with quick example, YAML schema, management commands,
  behavior notes.
- website/docs/reference/cli-commands.md — 'hermes bundles' added to
  the top-level command table and given its own subcommand section.

* feat(kanban): add scheduled status for delayed follow-ups

Salvages #24533 by @roycepersonalassistant. Adds a first-class
'scheduled' Kanban status for time-delay follow-ups that aren't
waiting on human input.

- hermes kanban schedule <task_id> [reason] CLI command
- Dashboard/API transitions to/from Scheduled
- unblock_task() now releases both 'blocked' AND 'scheduled' tasks
  (re-checking parent dependencies before moving to ready/todo)
- i18n + docs updates

Resolved conflicts: kept HEAD's failure-counter reset on unblock
alongside the PR's scheduled state, kept HEAD's 'running' direct-set
rejection, combined both bulk-status branches. Dropped the dist/
bundle changes (months-stale; would need rebuild from source).

* feat(kanban): drag-to-delete trash zone + bulk delete for task cards

Salvages #28125 by @Jpalmer95. Adds:
- Drag-to-delete trash zone in the kanban dashboard
- Bulk delete endpoint with cascading delete_task cleanup
- Frontend updates (drag visual + drop handler)
- Confirmation prompt before delete

Resolved end-of-file test conflict by appending both halves.

* docs: add Korean Kanban documentation

Salvages #21823 by @pochi-gio. Adds Korean (ko) Docusaurus locale and
translates Kanban documentation (kanban.md, kanban-tutorial.md) and the
two related skills (devops-kanban-orchestrator, devops-kanban-worker).

Purely additive — adds ko to the locales list in docusaurus.config.ts
and creates the website/i18n/ko/ tree.

* fix(tests): catch up six stale tests after compression/aux/kanban changes (#28465)

- aux_config: drop session_search from _AUX_TASKS and remove stale test
  (PR #27590 removed auxiliary.session_search from DEFAULT_CONFIG)
- compression_boundary_hook: set compressor._last_compress_aborted=False
  on MagicMock so the post-compress abort branch (PR #28117) doesn't
  short-circuit before the session-id rotation under test
- kanban_dashboard_plugin: use consecutive_failures=3 so severity stays
  'error' (failure_threshold default dropped from 3 to 2 in d9fef0c8a,
  so failures=5 now crosses the critical floor of 2*2=4)
- cli_manual_compress: accept force kwarg on DummyAgent._compress_context
  (cli._manual_compress now passes force=True)

* fix(telegram): render full clarify choice text in message body, use short button labels

When Telegram clarify prompts offer long choices, mobile clients
truncate the inline button labels, making options unreadable.
Previously only the question was shown in the message body with
truncated choice text in button labels.

Fix: append the full numbered option list to the message body
so users can read complete choice text on any client.  Buttons
now use short numeric labels (1, 2, ...) to avoid Telegram
truncation.  The 'Other (type answer)' button is unchanged.

Long choice labels are now rendered in full (not truncated to
57 chars + '...') since they appear in the body instead of
button labels.

Closes: #27497

* chore(release): map @asdlem for PR #27852 salvage

* fix(telegram): default streaming transport to edit

* fix(telegram): respect reply_to_mode for DM topic reply fallback

The DM topic reply fallback code in send() hardcoded should_thread=True
when telegram_dm_topic_reply_fallback metadata was present, bypassing
_should_thread_reply() and ignoring reply_to_mode config. This caused
quote bubbles on every response even with reply_to_mode: 'off'.

Fix:
- Add reply_to_mode param to _reply_to_message_id_for_send() and
  _thread_kwargs_for_send() classmethods
- In send(), check self._reply_to_mode != 'off' for DM topic fallback
- Suppress reply anchor and reply_to_message_id when mode is 'off'
  while preserving message_thread_id for correct topic routing
- Thread reply_to_mode through all 29 call sites

Regression coverage: 10 new tests in test_telegram_reply_mode.py
covering classmethod behavior, send() integration, and backward
compatibility.

Fixes reply_to_mode: 'off' ignored by Telegram DM topic reply fallback code #23994

* fix(gateway): route Telegram audio file attachments away from STT pipeline (#24870)

Telegram distinguishes three kinds of audio payloads:
  - message.voice  → Opus/OGG voice messages  → STT pipeline  ✓
  - message.audio  → audio file attachments   → bypasses STT  ← was broken
  - message.document (audio mime) → generic file route

**Root cause** — the inbound message routing block in gateway/run.py
matched both MessageType.VOICE *and* MessageType.AUDIO into audio_paths,
which were then fed unconditionally to _enrich_message_with_transcription.
Audio file attachments (.mp3, .m4a, etc.) were therefore auto-transcribed
instead of being treated as files, making the transcribe skill unusable
from Telegram because the path it needed was never surfaced.

**Fix**
- Introduce a new audio_file_paths list populated exclusively by
  MessageType.AUDIO events.
- Narrow the audio_paths selector to MessageType.VOICE (and bare
  audio/ mime-type events that are not explicitly AUDIO or DOCUMENT).
- After the STT block, inject a document-style context note for each
  audio_file_path, giving the agent the file path and asking what to do
  with it (consistent with how plain documents are handled).

**Tests** — 5 new tests in test_telegram_audio_vs_voice.py:
  - voice message still transcribed (regression guard)
  - audio attachment skips STT (core fix)
  - audio attachment context note format
  - STT disabled still produces file note (not STT-disabled notice)
  - MessageType.AUDIO != MessageType.VOICE sanity check

Fixes #24870

* chore(release): map bartok9 noreply for PR #24879 salvage

* fix(send_message): route standalone Telegram sends through TELEGRAM_PROXY

When the send_message tool runs outside the gateway process (agent loop,
TUI, cron, etc.), _gateway_runner_ref() returns None and the standalone
path in _send_telegram constructs Bot(token=token) directly, bypassing
any configured proxy. In regions where api.telegram.org is blocked, the
send times out after ~5s with 'Telegram send failed: Timed out' and
nothing ever shows up in gateway.log because the request never reaches
the gateway.

Resolve TELEGRAM_PROXY (via gateway.platforms.base.resolve_proxy_url,
which also honours HTTPS_PROXY/HTTP_PROXY/ALL_PROXY and NO_PROXY) just
before constructing the Bot. When a proxy is found, attach an
HTTPXRequest(proxy=...) for both 'request' and 'get_updates_request',
matching what gateway/platforms/telegram.py already does for in-gateway
sends and what the Discord standalone sender already does. Any
exception attaching the proxy falls back cleanly to a direct connection,
preserving prior behaviour for users without a proxy configured.

Adds tests/tools/test_send_message_telegram_proxy.py covering both the
proxy-configured and no-proxy cases.

* chore(release): map @pepelax for PR #25419 salvage

* fix(kanban-dashboard): restore implementations dropped during salvages (#28481)

Four kanban dashboard test failures, all from PR salvages that picked up
the test additions but dropped the corresponding implementations.

- BOARD_COLUMNS: add 'review' (status added by PR f55d94a1e but the
  board API never grew the column → test_board_empty failed because
  VALID_STATUSES - {archived} mismatched the rendered columns).
- update_task: enrich the 'ready' 409 detail with the blocking parent
  list (id, title, status) and add _parents_blocking_ready helper.
  Implementation lost in the #26744 salvage (commit e215558ba) which
  pinned the test but not the server-side code.
- dist/index.js: add parseApiErrorMessage helper, wire it through the
  drag/drop banner, add patchErr state to the TaskDrawer and surface
  it inline by the action row. Lost in the same #26744 salvage.
- test_diagnostics_endpoint_severity_filter: update to at-or-above
  semantics (PR a94ddd807 changed the filter from exact-match so the
  warning filter now correctly includes error+critical too).

* fix(gateway): roll over Telegram tool progress bubbles

* fix(gateway): scope audio_file_paths outside media_urls guard

The audio-file-paths handling block at line 7334 references the variable
unconditionally, but #24879 initialized it inside the 'if event.media_urls'
block — so events without media_urls hit UnboundLocalError.

Found via test_run_agent_queued_message_does_not_treat_commentary_as_final
after PR #28478 landed.

* fix(gateway): keep tool-progress edits alive after Telegram flood control

When a progress-message edit hits Telegram flood control (RetryAfter),
can_edit was unconditionally set to False, permanently disabling coalescing
for the rest of the run. Subsequent tool updates were posted as separate
new messages instead of updating the existing progress bubble.

Fix: only set can_edit=False for non-recoverable edit errors. On flood
control, back off by resetting _last_edit_ts so the throttle interval is
respected before the next edit attempt.

Fixes #25188

* chore(release): map @erhnysr for PR #25198 salvage

* fix(telegram): preser…
thcuba added a commit to thcuba/HAOS-hermes-agent that referenced this pull request May 19, 2026
* fix(approval): surface pending-approval state with explicit marker visible to LLM

When a tool call requires user approval in the non-blocking gateway path,
the LLM previously received a result that was indistinguishable from a
failed tool call (exit_code=-1, error=message). The LLM could not tell
whether the tool was pending approval, had returned empty results, or had
failed silently — causing it to burn context on wrong hypotheses.

Fix changes the result format to include:
- status: pending_approval (clear state name)
- approval_pending: True (explicit boolean for LLMs to detect)
- error: cleared to empty string (removes misleading error signal)

This lets the LLM reason about approval latency vs actual errors,
short-circuiting the previous silent failure mode.

Fixes #14806

* fix: recognize emoji and caret as natural response endings

GLM models via Ollama report finish_reason='stop' even when the
response was truncated by max_tokens. The continuation mechanism
uses _has_natural_response_ending() as one of the heuristics to
detect whether the response was genuinely finished.

Currently only ASCII punctuation and CJK punctuation are recognized.
This means any response ending with an emoji (e.g. ⚡, 👍) or the
caret character ^ (common in French ^^ smiley) is not recognized as
naturally ended, triggering a false-positive continuation where the
model receives 'Continue where you left off' and produces garbled
output.

Add:
- ^ (caret) to the punctuation set
- Unicode emoji range (codepoint >= 0x1F300) as natural ending

This only affects GLM/Ollama users but the fix is safe for all
backends since _has_natural_response_ending() is only consulted
inside the continuation flow.

* chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 8 (#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (#28032 — x.com/status fallbacks link-like)
- colin-chang (#28245, #28249, #28251 — gateway + mattermost fixes)
- felix-windsor (#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (#28205 — charizard completion menu contrast)
- iqdoctor (#28095 — windows installer docs)
- joe102084 (#28151 — whitespace-only cron responses)
- jvinals (#27936 — Slack U-IDs → DM channel)
- maxmilian (#28267 — ModelPickerDialog portal)
- samggggflynn (#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.

* fix: add pre_start() to _IncomingHandler for dingtalk SDK compatibility

The dingtalk-stream SDK calls pre_start() on every registered handler
before opening the WebSocket connection. Without this method, the SDK
raises AttributeError and kills the stream connection, causing DingTalk
to be unable to connect via Stream Mode.

* fix(windows): handle redirected stdout in _cprint fallback

Wraps _pt_print in try/except with a print() fallback. When a
kanban worker's stdout is piped to a log file, prompt_toolkit
raises NoConsoleScreenBufferError (Windows) or OSError (other)
because there is no real console buffer. The fallback keeps
worker output flowing instead of crashing.

* chore(release): alias stale-ID salvage commit for @Grogger (#28334)

PR #28330 was salvaged with a wrong noreply numeric ID (18091625 vs
the correct 7065068). The commit on main is correctly authored to
Grogger by username, but neither noreply form was in AUTHOR_MAP.
Adds both so release-notes generation maps them to @Grogger.

* fix(aux): remove stale session_search model menu entry

* fix(tui): keep x status citation fallbacks link-like

* fix(xai-oauth): quarantine dead tokens on terminal refresh failure

resolve_xai_oauth_runtime_credentials() called _refresh_xai_oauth_tokens()
with no try/except. A terminal refresh failure (HTTP 400/401/403 —
invalid_grant, token revoked) propagated without clearing the dead
access_token / refresh_token from auth.json, causing every subsequent
session to retry the same doomed network request.

Add a try/except around the refresh call that mirrors the existing
credential_pool.py quarantine: when _is_terminal_xai_oauth_refresh_error
identifies a non-retryable failure, clear the dead token fields from
auth.json and write a last_auth_error diagnostic marker so future calls
fail fast with a clear relogin_required error instead of hitting the
network.

active_provider is preserved (set_active=False) so multi-provider users
whose chosen provider is not xai-oauth are unaffected.

Tests: two new cases in test_auth_xai_oauth_provider.py cover terminal
quarantine and transient pass-through.

* feat(bg-review): add bundled/pinned skill protection rules to review prompts (#27644)

The background review prompts (_SKILL_REVIEW_PROMPT and
_COMBINED_REVIEW_PROMPT) now include explicit protection rules
for bundled, hub-installed, and pinned skills — aligning with
the curator's existing policy at curator.py L345/350.

Before this change, bg-review could freely rewrite bundled skills
like 'hermes-agent' or pinned skills, while the 7-day curator
explicitly skips them.

The review agent now sees:
  • Bundled skills (shipped with Hermes)
  • Hub-installed skills (installed via hermes skills install)
  • Pinned skills (marked via hermes curator pin)
If only protected skills need updating, the review says
'Nothing to save.' and stops.

Fixes #27644

* fix(web): portal Change Model modal so it renders above the app sidebar

The dashboard's main column is `relative z-2` (App.tsx), which creates a
stacking context that traps fixed descendants below the app sidebar
(`z-50`). `ModelPickerDialog` renders `fixed inset-0 z-[100]` inline,
so its z-100 is scoped to z-2 and the sidebar covers its left edge.

The bug is visible across all themes but only obvious in the Large theme
variants (Hermes Teal (Large), etc.) where the larger root font widens
the dialog into the sidebar's column. Toast.tsx already documents the
same trap and uses the same `createPortal(..., document.body)` escape.

This commit ports the picker; the same pattern affects other inline
z-[100] modals in the dashboard (OAuthLoginModal, Cron / Models /
Profiles page modals) and is left for a follow-up — keeping this PR
scoped to the reporter's specific case.

Fixes #28103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gateway): exit code 75 on service restart so launchd relaunches

When the gateway receives SIGUSR1 (graceful restart via launchd_restart),
the SIGUSR1 handler calls request_restart(via_service=True) and the
gateway shuts down cleanly with exit code 0.

However, the generated launchd plist uses KeepAlive → SuccessfulExit →
false, meaning launchd only relaunches on *non-zero* exit codes.  A
clean exit(0) is treated as "successful, don't restart", so the
gateway stays down after /restart, /update, or SIGUSR1.

The systemd unit template already uses RestartForceExitStatus=75 for the
same scenario.  Mirror that convention: when _restart_via_service is
True, raise SystemExit(75) so launchd's SuccessfulExit=false policy
triggers a relaunch.

Closes #28135

* fix: guard json.loads() against invalid TTS and skill_view responses

Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.

1. gateway/run.py — text_to_speech_tool() result in voice reply path.
   A TTS failure that returns an error string instead of JSON crashes
   the voice reply handler, killing the message response entirely.

2. cron/scheduler.py — skill_view() result when loading skills for
   cron jobs. A corrupted or missing skill file that returns an error
   string instead of JSON crashes the cron tick, preventing all jobs
   from executing that cycle.

Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.

* fix(gateway): bridge gateway_restart_notification from YAML platform sections

Two related bugs in gateway/config.py prevented per-platform
gateway_restart_notification from working through config.yaml:

1. The shared-key bridging loop (load_gateway_config) omitted
   'gateway_restart_notification', so the key never landed in
   platform_data['extra'] even when set under e.g. 'discord:' or
   'mattermost:' sections.

2. PlatformConfig.from_dict() only read gateway_restart_notification
   from the top-level data dict, ignoring the 'extra' sub-dict where
   bridged keys are stored.

Fix: add the key to the bridging loop, and add an 'extra' fallback
in from_dict() so that round-tripped values (YAML → bridged → extra
→ from_dict) resolve correctly.

Impact: users can now set gateway_restart_notification: false per
platform in config.yaml instead of relying on env vars or the
global platforms: block.

* feat(kanban): add auto_promote_children config toggle

When the kanban auto-decomposer fans a triage task into child tasks,
recompute_ready() immediately promotes parent-free children to 'ready'
so the dispatcher picks them up. Some users want a manual workflow
where children stay in 'todo' for review before dispatch.

Add 'kanban.auto_promote_children' config key (default: true):
- false: children stay in 'todo' after decomposition
- true: existing behavior (auto-promote to 'ready')

Changes:
- kanban_db.py: decompose_triage_task() gains auto_promote param
- kanban_decompose.py: reads auto_promote_children from config
- kanban dashboard API: exposes the new setting in GET/PUT /orchestration

Closes #28016

* fix: wrap _pool_may_recover_from_rate_limit call through run_agent namespace

The conversation_loop.py references _pool_may_recover_from_rate_limit which
was defined in run_agent.py. After the conversation-loop extraction refactor,
the helper was no longer in the same module scope. Wrap the call as
_ra()._pool_may_recover_from_rate_limit() to route through the run_agent
monkeypatch namespace where the helper is available.

Adds regression test in test_gemini_fast_fallback.py.

Fixes: MAILROOM Email Triage NameError, OPS Execution Monitor NameError.

* fix(tui): improve charizard completion menu contrast

* docs(windows): avoid piping installer directly into iex

* fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS

Qwen3.x and DeepSeek-V3.x default to chatty/hallucinatory tool use without
enforcement steering — agents narrate "calling tool X" without actually
emitting a tool call, or run partial loops. Both model families fit the
same failure pattern TOOL_USE_ENFORCEMENT_GUIDANCE was already injected
for (gpt, codex, gemini, gemma, grok, glm).

Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>

Squashed salvage of:
- 403e567ce fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS
- 9433eabe7 test(agent): use realistic qwen-plus identifier in enforcement test

Fixes #28079.

* fix(send_message): resolve Slack user IDs to DM channel IDs

The _SLACK_TARGET_RE regex only matched IDs starting with C (channel),
G (group), or D (direct message). Slack user IDs start with U, causing
'Could not resolve' errors when trying to send DMs to specific users.

Changes:
- Expand _SLACK_TARGET_RE to accept U-prefixed IDs (user IDs)
- Add conversations.open fallback to resolve user IDs to DM channel
  IDs before sending, since chat.postMessage requires a conversation ID

Fixes #ISSUE_NUMBER

* fix(gateway): tighten MEDIA extraction regex + silent skip on file-not-found

Three related fixes for the MEDIA:<path> extraction pipeline that
caused 'file not found' noise in platform channels:

1. run.py — tighten tool-result MEDIA regex from \S+ (any non-
   whitespace) to require a path pattern with known extensions.
   Prevents LLM-generated placeholder paths like
   'MEDIA:/path/to/example.mp4' from being captured as real media.

2. base.py — remove the |\S+ fallback in extract_media() that
   catches anything non-whitespace as a potential MEDIA path.
   This was the primary cause of false positives — strings like
   '' in tool output were captured as MEDIA: paths.

3. mattermost.py — replace the file-not-found error message sent
   to the channel with a silent logger.warning() skip. When a
   path extracted by MEDIA doesn't exist on disk, the channel
   no longer gets a noisy '(file not found: ...)' message.

Impact: eliminates the persistent 'file not found' spam in
Mattermost channels caused by over-broad MEDIA regex patterns
matching non-path text in tool output.

* fix(xai-oauth): split 403 (tier/entitlement) from 400/401 in token endpoint

xAI's token endpoint returns HTTP 403 to the OAuth grant when the
account isn't on the allowlist for API access (e.g. standard
SuperGrok subscribers — see #26847). Treating it like a stale-token
400/401 made ``format_auth_error`` append "Run ``hermes model`` to
re-authenticate", which is misleading because re-login can't change
xAI's tier decision.

Split 403 off in both ``refresh_xai_oauth_pure`` and the loopback
login token exchange:

* New error code ``xai_oauth_tier_denied`` with ``relogin_required=False``
* Message explains the entitlement gate and points at the
  ``XAI_API_KEY`` + ``provider: xai`` fallback
* 400/401 still set ``relogin_required=True`` as before
* 5xx still set ``relogin_required=False`` as before

* fix(run-agent): treat any 403 on xai-oauth as entitlement to stop refresh-loop

The existing ``_is_entitlement_failure`` heuristic only fires when
the response body contains specific substrings ("do not have an
active Grok subscription", etc.). xAI has been seen to 403 standard
SuperGrok subscribers with a terser body that doesn't match those
keywords (#26847), and the recovery path would then mint a fresh
token, get a fresh 403, and loop until Ctrl+C.

Add a defense-in-depth check at the recovery call site: any 403 on
``provider == "xai-oauth"`` short-circuits ``try_refresh_current``
so the error surfaces immediately with the friendly hint from
``_summarize_api_error``. Keeps the existing keyword path for all
other providers untouched.

* test(xai-oauth): pin tier-denied 403 behavior + docs warning for #26847

Tests:

* ``test_refresh_xai_oauth_pure_403_marked_tier_denied_not_relogin`` —
  refresh-403 raises ``xai_oauth_tier_denied`` with
  ``relogin_required=False`` and the API-key fallback hint in body.
* ``test_format_auth_error_tier_denied_does_not_suggest_relogin`` —
  the renderer does not append "Run ``hermes model``" for the new
  code.
* ``test_recover_with_credential_pool_skips_refresh_on_bare_403_for_xai_oauth`` —
  bare ``{"reason":"forbidden","message":"Forbidden"}`` body (which
  does not match the existing keyword heuristic) still short-circuits
  ``try_refresh_current`` on xai-oauth.

Docs:

* Drop the "(any active tier)" claim from the xai-grok-oauth guide,
  add a top-of-page warning callout, and a Troubleshooting section
  for the 403-after-login case pointing at ``XAI_API_KEY`` +
  ``provider: xai`` as the documented fallback.

* fix: handle whitespace-only cron responses

* fix(cli): preserve cron asterisks in strip mode

* fix(mattermost): resolve thread root_id and route progress to threads

Two Mattermost thread-related bugs:

1. _resolve_root_id() — Mattermost CRT requires root_id to be the
   thread root post. Using any reply's own ID as root_id causes
   '400 Invalid RootId'. Add _resolve_root_id() that walks up the
   post chain via API to find the actual root, and apply it in
   send(), _send_url_as_file(), and _send_local_file().

2. _progress_reply_to — The condition in run.py only checked
   Platform.FEISHU, missing Mattermost entirely. This caused tool
   progress messages to always land in the main channel instead of
   the thread. Add Platform.MATTERMOST to the condition so
   progress messages are routed to threads when reply_mode=thread.

Impact: Tool progress messages now appear in Mattermost threads
instead of flooding the main channel; thread replies no longer
fail with Invalid RootId when the reply target is itself a reply.

* feat(kanban): archive --rm to hard-delete archived tasks

Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to
permanently remove already-archived tasks with cascading cleanup of
links, comments, events, runs, and notify-subs. Safety guard: only
archived tasks can be deleted; active/blocked/done must be archived
first.

Cherry-picked from #19964 onto current main (severe stale base, applied
manually to preserve substance only).

* feat(proxy): add xai upstream adapter for Grok via OAuth

* chore(release): map @yannsunn for PR #28064 xai proxy adapter salvage

* docs(skill): align kanban dispatcher failure_limit text with current default

* fix(oauth): add manual-paste fallback for browser-only remote consoles

xAI Grok OAuth (and Spotify) use a loopback redirect to
``http://127.0.0.1:<port>/callback`` to capture the authorization
code. That works when the browser and Hermes run on the same
machine, and the SSH tunnel recipe handles the regular remote
case. It breaks completely on **browser-only remote consoles**
(GCP Cloud Shell, GitHub Codespaces, AWS EC2 Instance Connect,
Gitpod, Replit, …) where the user has a browser but no real SSH
client to forward a port — the redirect to 127.0.0.1 on the
remote VM simply isn't reachable from the laptop, and there's
nothing the existing flow can do about it (#26923).

This commit adds the foundation for a manual-paste fallback:

* ``_is_remote_session`` now also recognises Cloud Shell,
  Codespaces, Gitpod, Replit, StackBlitz (in addition to SSH),
  so the existing tunnel hint at least fires in those
  environments.
* ``_parse_pasted_callback`` accepts any of: a full
  ``http(s)://...?code=...&state=...`` URL, a bare ``?code=...``
  query string, a bare ``code=...&state=...`` fragment, or a
  bare opaque code value.  Returns the same dict shape the HTTP
  callback handler produces, so the caller's state / error
  validation works unchanged (no CSRF bypass).
* ``_prompt_manual_callback_paste`` reads stdin with a clear
  multi-line explanation of what's happening and what to paste.
* ``_xai_oauth_loopback_login`` gains a ``manual_paste`` kwarg
  that skips the HTTP listener entirely.  The redirect_uri,
  PKCE verifier, state, and nonce are byte-identical to the
  loopback path so xAI's token endpoint can't tell the
  difference at the protocol level.
* ``_print_loopback_ssh_hint`` now also mentions
  ``--manual-paste`` so users without a real SSH client see a
  path forward instead of a dead-end tunnel recipe.
* ``_login_xai_oauth`` threads ``args.manual_paste`` into the
  loopback helper.

* feat(cli): wire --manual-paste into ``hermes auth add`` and ``hermes model``

Register the new ``--manual-paste`` flag on both entry points and
thread it through to the xAI loopback login:

* ``hermes auth add xai-oauth --manual-paste`` — pool-add path,
  forwarded inside ``auth_commands.handle_auth_add``.
* ``hermes model --manual-paste`` — model-picker path, forwarded
  by ``_model_flow_xai_oauth`` into the synthetic ``argparse.Namespace``
  it passes to ``_login_xai_oauth``.  The picker also now forwards
  ``--no-browser`` and ``--timeout`` for consistency (previously
  hardcoded to defaults regardless of CLI flags).

Help text on both flags points at #26923 and names the
browser-only remote consoles (Cloud Shell, Codespaces, EC2
Instance Connect) so users searching ``hermes --help`` can find
the workaround.

* test+docs(oauth): pin manual-paste semantics and document browser-only path (#26923)

Tests (``tests/hermes_cli/test_auth_manual_paste.py``):

* 9 parametrised + scalar cases for ``_is_remote_session`` covering
  the new Cloud Shell / Codespaces / Gitpod / Replit / StackBlitz
  env vars (plus the existing SSH ones).
* 9 cases for ``_parse_pasted_callback`` covering every paste form
  (full URL, https URL with extra params, bare ``?code=...``, bare
  ``code=...`` fragment, bare opaque value, error+description,
  empty, whitespace-only, malformed URL).
* 3 cases for ``_prompt_manual_callback_paste`` (happy path, EOF,
  Ctrl-C).
* 3 end-to-end ``_xai_oauth_loopback_login(manual_paste=True)``
  cases: the HTTP server MUST NOT be started (asserted via a
  callable that raises if invoked), wrong state still rejected
  with ``xai_state_mismatch`` (no CSRF bypass), and empty paste
  surfaces ``xai_code_missing``.
* SSH-hint mention test ensures the ``--manual-paste`` instruction
  is printed in the remote-session hint.

Docs:

* ``oauth-over-ssh.md`` — new "Browser-only remote (Cloud Shell /
  Codespaces / EC2 Instance Connect)" section with the
  ``--manual-paste`` recipe, plus a TL;DR note for the new flag.
* ``xai-grok-oauth.md`` — short subsection pointing at the same
  recipe and the OAuth-over-SSH guide anchor.

* docs(kanban): document max-retries task override

* docs(kanban): document inline create shortcuts

* test(kanban): cover default board dashboard pin

* docs: ignore box diagrams in ascii guard

Wrap existing box-drawing diagrams with ascii-guard markers so docs-site checks pass when website docs are touched.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: per-task model override for kanban workers

- Add model_override field to Task class and tasks schema
- Add migration for existing databases
- Spawn worker with -m model when model_override is set

* test(kanban-dashboard): cover _task_dict task_age fallback

The fix in 061a1830 added an outer try/except in plugin_api._task_dict
so that a future failure mode in kanban_db.task_age (anything _safe_int
doesn't already absorb) cannot 500 the GET /board response. The
_safe_int / task_age corruption paths got regression coverage in
tests/hermes_cli/test_kanban_db.py, but the OUTER fallback contract
remained untested -- meaning a refactor that drops the try/except would
not be caught by CI.

Pin that contract from both consumers of _task_dict:
- GET /board returns 200 with the literal fallback age dict for the
  affected card (other cards continue to render via the same path)
- GET /tasks/:id (drawer view) returns 200 with the same fallback,
  so a single corrupt task can't block its own drawer

Both tests force task_age to raise RuntimeError rather than ValueError
on '%s', because ValueError is absorbed by _safe_int and never reaches
the outer try/except -- testing that path would only re-cover what
test_kanban_db.py already pins.

Manually verified the regression discipline:
  git checkout 061a1830^ -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both FAIL with 500
  git checkout HEAD -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both PASS

* fix(kanban): clear _INITIALIZED_PATHS in remove_board so recycled DBs re-init schema

Archiving or deleting a board via remove_board() leaves the path's
"schema already initialized" entry in the module-level cache. A
concurrent connect(board=<slug>) call (e.g. the dashboard event-stream
poll loop) then:

  1. resolves the same kanban.db path,
  2. recreates the directory + an empty sqlite file because
     connect() does mkdir(parents=True, exist_ok=True),
  3. skips the CREATE TABLE pass because the cache entry says the
     schema is already in place,
  4. errors on the next read with `no such table: task_events`.

Drop the cache entry before mutating the filesystem so the fresh file
gets a proper schema init on next connect(). Applies to both
archive=True (rename) and archive=False (rmtree) branches.

Fixes #23833.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(web): add Cache-Control: no-store to plugin static file serving

Prevents browser caching of stale dashboard plugin JS files that may
contain bugs already fixed upstream (e.g. COLUMN_LABEL undefined).

* fix(kanban): seed bundled skills (e.g. kanban-worker) on kanban init

Closes #23725

* fix(kanban): ignore stale HERMES_KANBAN_BOARD for removed boards

* fix(kanban): keep board-management commands independent from board override

* fix(kanban): preserve notifier_profile for dashboard home subscriptions

* fix(kanban): promote dependents when a parent is archived

* fix(cli): make kanban specify max_tokens configurable

* fix(kanban): sync slash subcommands with live parser

* fix(kanban): promote blocked tasks when parent dependencies complete

recompute_ready only scanned 'todo' tasks for promotion, ignoring
'blocked' tasks entirely. When a task was blocked (e.g. by the circuit
breaker) and its parent dependencies later completed, the task stayed
stuck in 'blocked' forever unless manually unblocked.

Now recompute_ready also scans 'blocked' tasks. When all parents are
done/archived, the blocked task is promoted to 'ready' with failure
counters reset — equivalent to an automatic unblock.

Includes a regression test for the blocked-parent-done promotion path.

* fix(kanban): use 'is not None' check for max_runtime_seconds in create_task

max_runtime_seconds=0 was being silently coerced to None due to a falsy
check (if max_runtime_seconds). Zero is a valid value that causes the
dispatcher to immediately time out a task. The adjacent max_retries
parameter already used the correct 'is not None' pattern.

Fixes the inconsistency by aligning max_runtime_seconds with max_retries.

* fix(kanban): reset failure counters on unblock_task

When a task is manually unblocked (blocked → ready/todo), the
consecutive_failures counter and last_failure_error were left intact.
The next failure would immediately re-trip the circuit breaker because
the counter was still at or above the failure limit.

Reset both fields on unblock so the task gets a fresh retry budget.

Includes a regression test that verifies counters are zeroed.

* fix(kanban): fingerprint crash errors to prevent fleet-wide retry exhaustion

When a systemic failure (provider outage, auth expiry, OOM) crashes
multiple workers simultaneously, detect_crashed_workers increments
each task failure counter independently. The circuit breaker only
trips after N × failure_limit retries across the fleet.

Fingerprint crash errors by normalizing host-specific details (PIDs,
timestamps). When 3+ tasks crash with the same fingerprint in a
single detection cycle, immediately trip the circuit breaker
(failure_limit=1) instead of waiting for repeated failures.

Isolated crashes (unique fingerprints) retain their normal retry
budget. Protocol violations continue to trip immediately.

Includes regression tests for systemic and isolated crash paths.

* fix(kanban): align board_exists with board discovery rules

* fix(kanban): demote ready children when a parent is reopened

* fix(kanban): serialize DB initialization

* fix(kanban): task_age() tolerates ISO-8601 timestamps

Prevents ValueError crash in dashboard get_board() when a task has
an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch
int. Adds _to_epoch() helper that normalises both formats.

* Fix Kanban dashboard initial board selection

* fix(kanban): persist worker session metadata on completion

Salvages #25579 by @wesleysimplicio. Stamps task_runs.metadata.worker_session_id
from HERMES_SESSION_ID on kanban_complete. Cherry-picked the substantive
commit (not the AUTHOR_MAP fixup tip) onto current main.

* fix(kanban): make claim ttl configurable

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* fix(kanban): pass accept-hooks to worker chat subprocess

* feat(kanban): add board-level default workdir (#25430)

* docs(kanban-worker): document notification routing configuration

* fix(kanban): preserve worker tools with restricted toolsets

* fix(kanban): make legacy task migration idempotent

(cherry picked from commit 293f1c3a7241b0117669e049d9aa746c9645ac90)

* fix: harden Kanban worker Hermes command resolution

* feat(kanban): allow trimmed task comments

SS-1647 live SHIP validation: real code + tests for kanban comment --max-len.

* fix: show scheduled kanban tasks in dashboard

* fix: assign single-task kanban decompositions

* fix(kanban-dashboard): make Orchestration mode checkbox label static

The checkbox label echoed its state ("Auto (default)" / "Manual") instead
of describing the action, so a checked box reading "Auto" parsed as a
status indicator rather than a control. The accompanying sub-description
was also static and started with "When on, ...", which read awkwardly
when the box was unchecked.

Replace the dynamic label with a static action label
("Auto-decompose triage tasks") and flip the sub-description between the
two modes so it stays accurate either way. The top-of-page Orchestration
pill is unchanged — that one is intentionally a status badge / toggle.

Fixes #28178

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(env): add HERMES_KANBAN_DISPATCH_IN_GATEWAY override (#21956)

Salvages the env-vars docs portion of #21956 by @Bartok9.
The ascii-guard-ignore tags from the original PR already landed on main.

* fix(kanban): close sqlite connection on init failure to prevent fd leak

Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema
init raises after sqlite3.connect() succeeds, the new connection was
leaking. Wrap the body in try/except so the connection is closed before
the exception propagates.

* fix(kanban): don't crash dispatched workers when kanban-worker skill is absent

Salvages #27372 by @oemtalks. The dispatcher unconditionally injected
`--skills kanban-worker` into every worker spawn, but worker profiles
sometimes don't have that bundled skill in their skills dir, which is
fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`).

Adds `_kanban_worker_skill_available(hermes_home)` and only injects the
flag when the skill resolves. The MANDATORY lifecycle still ships via
KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.

* fix(packaging): ship dashboard plugin assets in wheel

Salvages #23737 by @LeonSGP43. Adds plugins/* manifest.json and dist/
glob entries to setuptools package-data so wheel installs ship the
bundled dashboard plugin assets (kanban, achievements, etc.). Without
these, /api/dashboard/plugins can't discover plugin assets outside a
source checkout.

* docs(kanban): document worker protocol auto-blocks

Salvages #21585 by @helix4u. Documents the protocol_violation event
(worker exits successfully while task is still running), adds
--max-retries to the create flag list and --failure-limit to dispatch.

* fix(oneshot): pass fallback_providers from profile config to AIAgent

Salvages #23368 by @uzunkuyruk. Oneshot workers (e.g. kanban workers
spawned via 'hermes -p <profile> chat -q ...') were not honouring the
profile's fallback_providers / fallback_model chain because oneshot.py
never read the config and never passed fallback_model= to AIAgent.

Reads cfg.get('fallback_providers') (new list format) or
cfg.get('fallback_model') (legacy single-dict) with the same
normalization cli.py applies, then forwards as fallback_model=_fb.

* fix(kanban): reject direct running transitions in dashboard bulk updates

Salvages #24050 by @kronexoi. The single-task PATCH already rejects
direct status='running' since it bypasses the dispatcher/claim invariant,
but the bulk-update endpoint still accepted it. Aligns bulk with single
by emitting an error result row for any 'running' entry.

* feat(kanban): add initial-status for human-ops cards

Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag
(running|blocked, default running) to 'kanban create', threaded through
kanban_db.create_task() and the kanban_create tool schema. 'blocked'
parks the task directly in the blocked column for R3 human-ops review,
skipping the brief running-to-blocked transition.

Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and
slash-handler error formatting changes that were bundled in the
original PR — those should ship as their own focused changes if still
wanted.

* fix(kanban): release scratch workspace and tmux session on task completion

Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace()
and _cleanup_worker_tmux() after marking a task complete.

Scratch workspaces (used by swarm agents) accumulate on disk — hundreds
of MB per task, never released. Stale tmux sessions from completed
agents also persist indefinitely.

Both gates are safe:
- workspace_kind == 'scratch' gate preserves user worktree/dir workspaces
- tmux #{pane_dead} == 1 gate only kills sessions where the worker has
  already exited
- best-effort: cleanup failures never block task completion

* fix(kanban): honor severity thresholds in diagnostics

Salvages #26431 by @LeonSGP43. Dashboard plugin_api list_diagnostics
was using exact-match (severity == filter), so '--severity warning'
hid 'error' and 'critical' diagnostics. Adds severity_at_or_above()
helper to kanban_diagnostics and uses it in the dashboard endpoint
(CLI already used SEVERITY_ORDER comparison correctly).

* test: isolate Kanban env pins in hermetic fixture

Salvages the substantive part of #22295 by @steezkelly. Adds the
missing HERMES_KANBAN_HOME, HERMES_KANBAN_RUN_ID, HERMES_KANBAN_CLAIM_LOCK,
HERMES_KANBAN_DISPATCH_IN_GATEWAY entries to _HERMES_BEHAVIORAL_VARS so
ambient developer-shell pins on those vars don't bleed into pytest runs.

The frozenset extraction + standalone regression test from the original
PR were dropped to keep the change minimal — main already maintains the
list inline.

* feat(kanban): add max_in_progress config to cap concurrent running tasks

Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config
that caps simultaneously running tasks. When the board already has N
running, dispatcher skips spawning so slow workers (local LLMs,
resource-constrained hosts) don't pile up and time out.

Threads through dispatch_once(max_in_progress=) and gateway dispatcher
config parsing with validation (warns on invalid/below-1 values).

* fix(packaging): ship bundled skills in wheel

Salvages #23738 by @LeonSGP43. Wheel installs were missing skills/ and
optional-skills/ because pyproject's [tool.setuptools.packages.find]
only includes Python packages — the skills directories don't have
__init__.py so they were silently dropped from the wheel.

Adds setup.py with data_files spec emitting skills/* and optional-skills/*
under hermes_agent-<v>.data/data/, and a get_bundled_skills_dir() helper
in hermes_constants that discovers the wheel-installed location via
sysconfig before falling back to a source-checkout path. tools/skills_sync
uses the helper so 'hermes update' works for pip-installed users.

* fix: 4 small surgical bugs

Salvages #23302 by @Bartok9. Four independent one-area fixes:

1. kanban boards delete alias now hard-deletes (not archives) — the
   alias didn't carry --delete, so getattr(args, 'delete', False)
   returned False. Detect boards_action=='delete' explicitly.
2. Gateway auto-title failures no longer leak as user-visible
   warnings — debug-log only since they're not actionable.
3. Background process completion notification snaps truncation to
   the next newline boundary, prepends a marker when content is
   dropped.
4. _cprint() schedules the run_in_terminal coroutine via
   asyncio.ensure_future so output isn't silently dropped from
   background threads (fixes #23185 Bug A). Skips the
   double-print fallback that would fire for mock paths.

* perf(prompt): cache kanban worker guidance at session init

Salvages #24402 by @RyanRana. The KANBAN_GUIDANCE block (~835 tokens)
is session-static — the dispatcher decides at spawn time whether the
process is a kanban worker via the kanban_show tool's check_fn (gated
on HERMES_KANBAN_TASK env var). Re-checking 'kanban_show' in
valid_tool_names and re-loading the reference on every system-prompt
rebuild (init + each context compression) is wasted work.

Caches the resolved string on agent._kanban_worker_guidance once in
agent_init and consumes it in system_prompt.build_system_prompt(),
with a getattr fallback for code paths that bypass agent_init.

* feat(kanban): add --sort option to 'hermes kanban list'

Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc,
priority,priority-desc,status,assignee,title,updated} to 'hermes kanban
list'. Validated against VALID_SORT_ORDERS map; invalid values raise
ValueError. Default behaviour (priority DESC, created ASC) is unchanged
when --sort is omitted.

* docs: add kanban codex lane skill

* feat(kanban): worker visibility endpoints (workers/active, runs/{id}, inspect)

Adds three read-only endpoints to the kanban dashboard plugin so the
SwitchUI workspace (and any other dashboard consumer) can track
workers across tasks without N+1 round-trips through /tasks/{task_id}.

- GET /workers/active
  Single SQL JOIN of task_runs + tasks where ended_at IS NULL,
  worker_pid IS NOT NULL, status='running'. Returns
  {workers: [...], count, checked_at}.

- GET /runs/{run_id}
  Direct lookup of any task_run row by id. Reuses existing
  kanban_db.get_run() helper and _run_dict() serialiser. 404 when
  not found. Mirrors GET /tasks/{task_id} 404 shape.

- GET /runs/{run_id}/inspect
  Live PID stats via psutil.Process.as_dict() — cpu_percent,
  memory_rss_bytes, memory_vms_bytes, num_threads, num_fds, status,
  create_time, cmdline. Short-circuits with alive:false when run
  has ended, has no worker_pid, the pid is gone, or psutil is
  unavailable. AccessDenied surfaces as alive:true with error
  rather than a 500.

11 new tests in tests/plugins/test_kanban_worker_runs.py cover the
empty-board case, running-task case, ended-run filtering,
missing-pid filtering, 404 paths, already-ended inspect, no-pid
inspect, dead-pid inspect, and live-pid inspect (psutil mocked).
All pass.

Companion termination endpoint (POST /runs/{run_id}/terminate) is
intentionally out of scope here — opening a separate issue first
since the RBAC and dispatcher-mediated soft-cancel design needs
maintainer input before code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): map contributor email for attribution check

* test(kanban-dashboard): pin enriched 409 detail and inline error wiring (#26744)

- Existing ``test_patch_drag_drop_move_todo_to_ready`` now asserts the
  enriched 409 detail names the blocking parent (id, quoted title, and
  current status), so the dashboard always has something actionable to
  render.
- New bundle-assertion test ``test_dashboard_surfaces_ready_blocked_error_inline``
  pins the frontend wiring: the ``parseApiErrorMessage`` helper exists,
  the drag/drop banner runs through it, and the drawer maintains a
  visible ``patchErr`` state that's cleared between PATCHes and tasks.

* docs(codex_app_server): document multi-root Kanban writable_roots (#27941)

Update the Codex app-server runtime guide's Kanban section to reflect
the new behaviour:

  * The sandbox override now adds the board DB directory plus every
    Kanban path the dispatcher pinned (HERMES_KANBAN_WORKSPACES_ROOT,
    HERMES_KANBAN_WORKSPACE, legacy HERMES_KANBAN_ROOT) -- deduplicated,
    DB-dir first.
  * The motivation note now includes the cross-mount artifact-write
    scenario (e.g. ``/media/.../kanban-workspaces/...`` on a separate
    drive) and links to issue #27941 so readers can find the original
    bug report.

* fix(gateway): quiet corrupt kanban dispatcher boards

Salvages substantive part of #26490 by @aqilaziz. Detects corrupt board
DBs ("file is not a database" / "database disk image is malformed")
and disables them by fingerprint until they're repaired, instead of
flooding the gateway log with repeated logger.exception tracebacks every
tick.

Cherry-picked the substantive commit (ea5b4ec2a); the tip commit was
an unrelated _is_dir OSError fix for service-path lookup. Dropped a
small test reformat that was bundled in the same commit.

* docs: align kanban readiness docs and smoke tests

Salvages #28199 by @bensargotest-sys. Aligns Kanban docs with current
tool registration: dispatcher-spawned task workers get task tools,
profiles that explicitly enable the kanban toolset get orchestrator
routing tools (kanban_list, kanban_unblock). Corrects failure-limit
text to current default of 2. Hardens the e2e subprocess script to
resolve repo root and use the spawnable default assignee. Updates the
diagnostics severity fixture to assert error below the critical
threshold.

* feat(kanban): surface per-task model_override in show + tool output

Salvages #26897 by @loicnico96. The per-task model_override DB column
already exists on main, but it wasn't exposed in user-facing surfaces.
This adds:
- 'kanban show' prints 'model: <name>' when model_override is set
- kanban_show / kanban_list tool responses include the model_override field

Original branch was stale (PR was authored against an older field name
'model'); applied the substantive surface exposure manually using the
current 'model_override' field name.

* feat(cli): add kanban swarm topology helper

Salvages #26791 by @Niraven. Adds 'hermes kanban swarm' to create a
durable Kanban Swarm v1 graph: a completed root/blackboard card,
parallel worker cards, a verifier gated on all workers, and a
synthesizer gated on the verifier. Stores shared swarm blackboard
updates as structured JSON comments on the root card.

Self-contained: new hermes_cli/kanban_swarm.py module + CLI wiring +
unit tests.

* feat(kanban): add optional board parameter to all MCP tools

Salvages #27598 by @nnnet. Adds optional 'board' parameter to all 9
kanban_* MCP tools via shared _connect helper. Backwards compatible —
omitting board keeps current pinned-board behavior. Useful for
orchestrator profiles that route across multiple boards.

Two-file scope: tools/kanban_tools.py + tests.

* feat(kanban): stamp originating ACP session_id on tasks

Salvages #23208 by @awizemann. Tracks which chat session created a
kanban task so clients can render a per-session board without falling
back to tenant + time-window heuristics.

- Schema: tasks gains nullable session_id TEXT column with index
  (additive migration in _migrate_add_optional_columns).
- ACP: server.py exposes the originating session id via HERMES_SESSION_ID
  with save/restore around the agent loop.
- Tool: kanban_create reads HERMES_SESSION_ID (with explicit override).
- CLI: 'hermes kanban list --session <id>' filter; JSON output exposes
  session_id.

* feat(kanban): wire dispatcher to dispatch review agents from review column

Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task
status and extends dispatch_once to monitor the review column as a
second dispatch source (in addition to the existing ready column).

- Adds 'review' to VALID_STATUSES
- Adds claim_review_task() — atomically transitions review → running
- Adds has_spawnable_review() — health telemetry mirror
- Extends dispatch_once with a review column dispatch loop
- Review agents get 'sdlc-review' skill auto-loaded

Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state,
test file additions). Adapted claim_review_task to main's
ttl_seconds: Optional[int] = None convention (matches claim_task).

* feat(kanban): stale detection for running tasks in dispatcher

Salvages #23790 by @thewillhuang. Adds detect_stale_running() to
the dispatcher cycle. Running tasks that have been started for longer
than dispatch_stale_timeout_seconds (default 14400 = 4h) without a
heartbeat in the last hour are auto-reclaimed to ready.

- New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables)
- New 'stale' field on DispatchResult
- detect_stale_running() in kanban_db.py with heartbeat freshness check
- Records outcome='stale' on run close + 'stale' event; ticks failure counter
- Wires config through gateway embedded dispatcher
- Updates _cmd_dispatch verbose/JSON output and daemon logging

Resolved test-file end-of-file conflict by appending both halves.

* feat(kanban): filter tasks by workflow fields and runs by status/outcome

Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing
workflow_template_id and current_step_key columns:

- list_tasks() accepts workflow_template_id and current_step_key kwargs
- 'hermes kanban list' adds matching CLI flags
- dashboard plugin_api also exposes the filters

Resolved a small conflict in list_tasks signature alongside main's
session_id and order_by additions; combined all three into the single
filter list.

* feat(kanban): add respawn guard to block repeat worker storms

Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker
spawn for tasks where:
- a recent run already succeeded (recent_success — within guard window)
- the previous run hit a quota/auth error (blocker_auth, also auto-blocks)
- a recent task comment includes a GitHub PR URL (active_pr)

The guard prevents repeat worker storms on the same bug/task. Includes
the contributor's review-findings fixup (regex hardening, observability,
auth coverage).

Resolved a small DispatchResult conflict alongside main's 'stale' field;
kept both. Authorship preserved via rebase merge.

* feat(kanban): show dashboard cron jobs across profiles

Salvages #27568 by @SerenityTn. Dashboard cron page now lists cron
jobs from all profiles, with profile-aware filter UI and storage
routing. Includes test coverage for cross-profile listing, mutation,
deletion, and validation.

Also fixes orphan conflict markers in config.py left by an earlier
salvage merge (kanban.dispatch_stale_timeout_seconds was double-nested
in HEAD/PR markers from #28452 salvage of #23790).

* fix(kanban): remove orphan conflict markers from config.py (#28458)

PR #28452 (salvage of #23790, stale detection) merged with leftover
git conflict markers in hermes_cli/config.py around the
`dispatch_stale_timeout_seconds` config block, breaking config import
and any code path that loads it. Cleans up the markers and keeps both
config blocks (worker log rotation/orchestrator + stale detection).

Resolves a self-introduced regression.

* fix(kanban): remove orphan conflict markers from kanban.py (#28459)

PR #28454 (salvage of #26745, workflow filter) merged with leftover
git conflict markers in hermes_cli/kanban.py at three sites:
- _task_to_dict() (session_id alongside workflow_template_id/current_step_key)
- p_list parser (--sort alongside --workflow-template-id/--step-key)
- _cmd_list (order_by alongside the new filter kwargs)

Cleans up the markers and keeps both halves at each site.

Resolves a self-introduced regression.

* feat(kanban): configure worktree paths and branches

Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so
tasks with workspace_kind='worktree' can pin a target branch on
create. Schema migration added to _migrate_add_optional_columns.

- Task.branch_name field + DB column + migration
- create_task accepts branch_name kwarg
- hermes kanban create --branch <name> flag
- kanban show output includes 'Branch: <name>' when set

Cherry-picked the substantive commit (a7558cf27); the PR's tip was
an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list
and show-output conflicts alongside main's session_id and
max_runtime_seconds additions; kept all three.

* feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373)

Skill bundles are tiny YAML files in ~/.hermes/skill-bundles/ that
group several skills under one slash command. Invoking /<bundle-name>
from any surface (CLI, TUI, dashboard, any gateway platform) loads
every referenced skill into a single combined user message.

Use cases:
- /backend-dev → loads github-code-review + test-driven-development
  + github-pr-workflow as one bundle.
- /research → loads several research skills together.
- Team task profiles shared via dotfiles.

Behavior:
- Bundles take precedence over individual skills when slugs collide.
- Missing skills are skipped with a note, not fatal.
- No system-prompt mutation — bundles generate a fresh user message
  at invocation time, the same way /<skill> does. Prompt cache stays
  intact.
- Works in CLI dispatch, gateway dispatch, autocomplete (CLI + TUI),
  /help display.

Schema (~/.hermes/skill-bundles/<slug>.yaml):
    name: backend-dev
    description: Backend feature work.
    skills:
      - github-code-review
      - test-driven-development
    instruction: |
      Optional extra guidance prepended to the loaded skills.

New module: agent/skill_bundles.py — load, scan, resolve, build
invocation message, save, delete. yaml.safe_load only; broken
bundles log a warning and are skipped, never raise.

New CLI subcommand: hermes bundles {list,show,create,delete,reload}.
Implementation in hermes_cli/bundles.py; wired in hermes_cli/main.py.
'bundles' added to _BUILTIN_SUBCOMMANDS so plugin discovery skips it.

New in-session slash command: /bundles lists installed bundles in
both CLI and gateway. /<bundle-name> dispatch added to CLI (cli.py)
and gateway (gateway/run.py) before the existing /<skill-name> path.

Autocomplete: SlashCommandCompleter gained an optional
skill_bundles_provider parameter that defaults to None — the prompt
shows '▣ <description> (N skills)' for bundles vs '⚡' for skills.

Tests:
- tests/agent/test_skill_bundles.py — 33 tests covering slugify,
  scan/cache freshness, resolve (including underscore→hyphen
  Telegram alias), build_bundle_invocation_message (loading, missing
  skills, user/bundle instruction injection, dedup), save/delete,
  reload diff, list sort.
- tests/hermes_cli/test_bundles.py — 8 tests for the CLI
  subcommand (create/list/show/delete/reload, --force, missing
  bundle errors).
- tests/gateway/test_bundles_command.py — 4 tests for the gateway
  handler and bundle resolution priority.

Live E2E: verified subprocess invocations of hermes bundles
{list,create,show,reload,delete} round-trip correctly against an
isolated HERMES_HOME.

Docs:
- website/docs/user-guide/features/skills.md — new 'Skill Bundles'
  section with quick example, YAML schema, management commands,
  behavior notes.
- website/docs/reference/cli-commands.md — 'hermes bundles' added to
  the top-level command table and given its own subcommand section.

* feat(kanban): add scheduled status for delayed follow-ups

Salvages #24533 by @roycepersonalassistant. Adds a first-class
'scheduled' Kanban status for time-delay follow-ups that aren't
waiting on human input.

- hermes kanban schedule <task_id> [reason] CLI command
- Dashboard/API transitions to/from Scheduled
- unblock_task() now releases both 'blocked' AND 'scheduled' tasks
  (re-checking parent dependencies before moving to ready/todo)
- i18n + docs updates

Resolved conflicts: kept HEAD's failure-counter reset on unblock
alongside the PR's scheduled state, kept HEAD's 'running' direct-set
rejection, combined both bulk-status branches. Dropped the dist/
bundle changes (months-stale; would need rebuild from source).

* feat(kanban): drag-to-delete trash zone + bulk delete for task cards

Salvages #28125 by @Jpalmer95. Adds:
- Drag-to-delete trash zone in the kanban dashboard
- Bulk delete endpoint with cascading delete_task cleanup
- Frontend updates (drag visual + drop handler)
- Confirmation prompt before delete

Resolved end-of-file test conflict by appending both halves.

* docs: add Korean Kanban documentation

Salvages #21823 by @pochi-gio. Adds Korean (ko) Docusaurus locale and
translates Kanban documentation (kanban.md, kanban-tutorial.md) and the
two related skills (devops-kanban-orchestrator, devops-kanban-worker).

Purely additive — adds ko to the locales list in docusaurus.config.ts
and creates the website/i18n/ko/ tree.

* fix(tests): catch up six stale tests after compression/aux/kanban changes (#28465)

- aux_config: drop session_search from _AUX_TASKS and remove stale test
  (PR #27590 removed auxiliary.session_search from DEFAULT_CONFIG)
- compression_boundary_hook: set compressor._last_compress_aborted=False
  on MagicMock so the post-compress abort branch (PR #28117) doesn't
  short-circuit before the session-id rotation under test
- kanban_dashboard_plugin: use consecutive_failures=3 so severity stays
  'error' (failure_threshold default dropped from 3 to 2 in d9fef0c8a,
  so failures=5 now crosses the critical floor of 2*2=4)
- cli_manual_compress: accept force kwarg on DummyAgent._compress_context
  (cli._manual_compress now passes force=True)

* fix(telegram): render full clarify choice text in message body, use short button labels

When Telegram clarify prompts offer long choices, mobile clients
truncate the inline button labels, making options unreadable.
Previously only the question was shown in the message body with
truncated choice text in button labels.

Fix: append the full numbered option list to the message body
so users can read complete choice text on any client.  Buttons
now use short numeric labels (1, 2, ...) to avoid Telegram
truncation.  The 'Other (type answer)' button is unchanged.

Long choice labels are now rendered in full (not truncated to
57 chars + '...') since they appear in the body instead of
button labels.

Closes: #27497

* chore(release): map @asdlem for PR #27852 salvage

* fix(telegram): default streaming transport to edit

* fix(telegram): respect reply_to_mode for DM topic reply fallback

The DM topic reply fallback code in send() hardcoded should_thread=True
when telegram_dm_topic_reply_fallback metadata was present, bypassing
_should_thread_reply() and ignoring reply_to_mode config. This caused
quote bubbles on every response even with reply_to_mode: 'off'.

Fix:
- Add reply_to_mode param to _reply_to_message_id_for_send() and
  _thread_kwargs_for_send() classmethods
- In send(), check self._reply_to_mode != 'off' for DM topic fallback
- Suppress reply anchor and reply_to_message_id when mode is 'off'
  while preserving message_thread_id for correct topic routing
- Thread reply_to_mode through all 29 call sites

Regression coverage: 10 new tests in test_telegram_reply_mode.py
covering classmethod behavior, send() integration, and backward
compatibility.

Fixes reply_to_mode: 'off' ignored by Telegram DM topic reply fallback code #23994

* fix(gateway): route Telegram audio file attachments away from STT pipeline (#24870)

Telegram distinguishes three kinds of audio payloads:
  - message.voice  → Opus/OGG voice messages  → STT pipeline  ✓
  - message.audio  → audio file attachments   → bypasses STT  ← was broken
  - message.document (audio mime) → generic file route

**Root cause** — the inbound message routing block in gateway/run.py
matched both MessageType.VOICE *and* MessageType.AUDIO into audio_paths,
which were then fed unconditionally to _enrich_message_with_transcription.
Audio file attachments (.mp3, .m4a, etc.) were therefore auto-transcribed
instead of being treated as files, making the transcribe skill unusable
from Telegram because the path it needed was never surfaced.

**Fix**
- Introduce a new audio_file_paths list populated exclusively by
  MessageType.AUDIO events.
- Narrow the audio_paths selector to MessageType.VOICE (and bare
  audio/ mime-type events that are not explicitly AUDIO or DOCUMENT).
- After the STT block, inject a document-style context note for each
  audio_file_path, giving the agent the file path and asking what to do
  with it (consistent with how plain documents are handled).

**Tests** — 5 new tests in test_telegram_audio_vs_voice.py:
  - voice message still transcribed (regression guard)
  - audio attachment skips STT (core fix)
  - audio attachment context note format
  - STT disabled still produces file note (not STT-disabled notice)
  - MessageType.AUDIO != MessageType.VOICE sanity check

Fixes #24870

* chore(release): map bartok9 noreply for PR #24879 salvage

* fix(send_message): route standalone Telegram sends through TELEGRAM_PROXY

When the send_message tool runs outside the gateway process (agent loop,
TUI, cron, etc.), _gateway_runner_ref() returns None and the standalone
path in _send_telegram constructs Bot(token=token) directly, bypassing
any configured proxy. In regions where api.telegram.org is blocked, the
send times out after ~5s with 'Telegram send failed: Timed out' and
nothing ever shows up in gateway.log because the request never reaches
the gateway.

Resolve TELEGRAM_PROXY (via gateway.platforms.base.resolve_proxy_url,
which also honours HTTPS_PROXY/HTTP_PROXY/ALL_PROXY and NO_PROXY) just
before constructing the Bot. When a proxy is found, attach an
HTTPXRequest(proxy=...) for both 'request' and 'get_updates_request',
matching what gateway/platforms/telegram.py already does for in-gateway
sends and what the Discord standalone sender already does. Any
exception attaching the proxy falls back cleanly to a direct connection,
preserving prior behaviour for users without a proxy configured.

Adds tests/tools/test_send_message_telegram_proxy.py covering both the
proxy-configured and no-proxy cases.

* chore(release): map @pepelax for PR #25419 salvage

* fix(kanban-dashboard): restore implementations dropped during salvages (#28481)

Four kanban dashboard test failures, all from PR salvages that picked up
the test additions but dropped the corresponding implementations.

- BOARD_COLUMNS: add 'review' (status added by PR f55d94a1e but the
  board API never grew the column → test_board_empty failed because
  VALID_STATUSES - {archived} mismatched the rendered columns).
- update_task: enrich the 'ready' 409 detail with the blocking parent
  list (id, title, status) and add _parents_blocking_ready helper.
  Implementation lost in the #26744 salvage (commit e215558ba) which
  pinned the test but not the server-side code.
- dist/index.js: add parseApiErrorMessage helper, wire it through the
  drag/drop banner, add patchErr state to the TaskDrawer and surface
  it inline by the action row. Lost in the same #26744 salvage.
- test_diagnostics_endpoint_severity_filter: update to at-or-above
  semantics (PR a94ddd807 changed the filter from exact-match so the
  warning filter now correctly includes error+critical too).

* fix(gateway): roll over Telegram tool progress bubbles

* fix(gateway): scope audio_file_paths outside media_urls guard

The audio-file-paths handling block at line 7334 references the variable
unconditionally, but #24879 initialized it inside the 'if event.media_urls'
block — so events without media_urls hit UnboundLocalError.

Found via test_run_agent_queued_message_does_not_treat_commentary_as_final
after PR #28478 landed.

* fix(gateway): keep tool-progress edits alive after Telegram flood control

When a progress-message edit hits Telegram flood control (RetryAfter),
can_edit was unconditionally set to False, permanently disabling coalescing
for the rest of the run. Subsequent tool updates were posted as separate
new messages instead of updating the existing progress bubble.

Fix: only set can_edit=False for non-recoverable edit errors. On flood
control, back off by resetting _last_edit_ts so the throttle interval is
respected before the next edit attempt.

Fixes #25188

* chore(release): map @erhnysr for PR #25198 salvage

* fix(telegram): preserve can_edit after transient network errors in progress edits (#27828)

When edit_message_text fails with a transient error (httpx.ConnectError,
NetworkError, server disconnected, timeouts), the progress-message sender
must not permanently set can_edit = False — that would convert a single
Telegram network hiccup into separate per-tool bubbles for the rest of the run.

Changes:
- gateway/platforms/telegram.py: edit_message now returns retryable=True for
  transient network errors (ConnectError, NetworkError, timeouts, server
  disconnects, temporarily unavailable). Permanent failures (flood control,
  message-not-found, permissions) remain retryable=False.
- gateway/run.py: send_progress_messages checks result.retryable before
  setting can_edit = False. Transient failures skip the fallback-send and
  continue — the next edit cycle catches up with the accumulated lines.
  Permanent failures (flood, message-not-found, etc.) still disable editing.

Tests: 22 new tests in test_telegram_progress_edit_transient.py covering
transient vs permanent error classification, SendResult.retryable semantics,
and the can_edit decision logic.

Fixes #27828

* fix(telegram): recover from post-update polling conflict without entering limbo

* fix(test+release): update conflict retry count for MAX=5; map @CryptoByz

* fix(gateway): route background-process notifications into Telegram DM topics

Background-process completion notifications (notify_on_complete) and
watch-pattern notifications were always delivered to the Telegram main
chat instead of the originating private-chat topic.

Hermes-created Telegram DM topic lanes only render a send when it carries
both message_thread_id and a reply anchor. The synthetic MessageEvent
injected on process completion had no message_id, so _reply_anchor_for_event
returned None and _thread_kwargs_for_send dropped message_thread_id
entirely — routing the notification to the main chat.

Capture the triggering message id at spawn time and thread it through to
the synthetic event so it can be reply-anchored back into the topic:

- session_context: add HERMES_SESSION_MESSAGE_ID context var
- telegram adapter: populate SessionSource.message_id on inbound messages
- terminal tool: persist watcher_message_id on the process session
- process registry: carry/persist message_id on watcher dicts + checkpoint
- gateway: set MessageEvent.message_id on injected notifications

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): map @fabiosiqueira for PR #27212 salvage

* fix(telegram): route resumed DM topic sends directly

* fix(telegram): enforce TELEGRAM_ALLOWED_USERS allowlist on inbound messages

TELEGRAM_ALLOWED_USERS was only checked for callback/inline-button
actions but not for inbound messages. Unauthorized users triggered an
'Unauthorized user' log warning but their messages were still processed
by the agent — a P0 security bypass (issue #23778).

Fix: add allowlist check in _should_process_message() which is called
for all message types (text, command, media, location). If the sender
is not in TELEGRAM_ALLOWED_USERS, the message is dropped immediately
with a warning log. Empty TELEGRAM_ALLOWED_USERS continues to allow
all users (existing behavior).

Fixes #23778

* fix(telegram): fail-closed auth fallback when TELEGRAM_ALLOWED_USERS is empty

The _is_callback_user_authorized fallback returned True when
TELEGRAM_ALLOWED_USERS was not set, allowing any Telegram user
to interact with the bot. Change to fail-closed: deny by default
unless GATEWAY_ALLOW_ALL_USERS=true is explicitly set.

Fixes #24457

* test(telegram): stub _is_callback_user_authorized in trigger-gating fixture

After PR #24468 made the empty-allowlist callback auth fail-closed
(and #23795 wired _is_callback_user_authorized into _should_process_message),
trigger-gating tests started failing because their fake messages from
user 111 hit the new deny-by-default path before trigger evaluation.

Force-authorize all senders in _make_adapter() so the trigger logic
under test runs.  The fail-closed behavior itself is covered by
test_telegram_callback_auth_fail_closed.py.

* fix(telegram): reset sticky fallback IP on connect failure, retry primary DNS

When a sticky fallback IP (from DoH discovery) becomes unreachable,
the transport previously got stuck in an attempt_order that only
tried the dead IP.  This prevented the gateway from recovering
until the service was restarted.

Changes:
- Always include primary DNS path (None) after the sticky IP in the
  attempt_order so that a primary-path retry happens on sticky failure.
- Reset self._sticky_ip to None when the currently sticky IP hits
  a connect timeout / connect error, allowing the next request to
  retry from scratch.

Fixes silent Telegram disconnection when discovered fallback IPs
are transiently or permanently unreachable.

* test+release: align stale sticky-IP test for #24511; map @falconexe

* fix(telegram): propagate extra base_url config

* feat(send_message): auto-detect @username mentions and create Telegram entities

When sending messages containing @username patterns, auto-generate
MessageEntity(type='mention') entries so that the receiving bot's
require_mention filter can trigger. This enables proper bot-to-bot
interop where mention-based routing is used.

* test+release: align send_message mocks for MessageEntity import; map @fonhal

* fix(telegram): resume typing indicator after inline approval click (#27853)

The text /approve and /deny paths in gateway/run.py call
resume_typing_for_chat() after resolve_gateway_approval() succeeds, but
the Telegram inline-button (ea:*) callback in _handle_callback_query did
not. Typing is paused when the approval is sent (gateway/run.py:15658),
so without a matching resume the typing indicator stayed gone for the
remainder of a long-running turn after a button click.

Symmetry-match the text path: after a successful resolve, call
self.resume_typing_for_chat(str(query_chat_id)). Guarded by count > 0
to matc…
dimavrem22 pushed a commit to inkbox-ai/hermes-agent that referenced this pull request May 20, 2026
* fix(acp): treat polished tool error payloads as failed

* fix(acp): also mark raised-exception tool results as failed

Extends #26573 to also catch the case the original PR deliberately left
out: when a tool raises an exception, the agent's tool executor wraps it
in a canonical 'Error executing tool '<name>': ...' string prefix (see
agent/tool_executor.py around the try/except). That prefix is unique to
the wrapper and cannot legitimately appear in well-behaved tool output,
so it is a safe signal that the tool blew up.

Without this, the canonical 'tool raised' case still rendered as a green
'completed' row in Zed despite being a runtime failure — exactly the
class of bug #26573 set out to fix.

Adds a positive test (raised-exception prefix -> failed) and a negative
test (bare 'Error:' word in legit tool output stays completed) so a
future contributor doesn't accidentally widen the rule to false-positive
on compiler/linter diagnostics.

* fix(acp): refresh session info after auto-title

* fix(acp): use refresh moment as updated_at on session info push

Follow-up to #26543. The sessions table does not have an updated_at
column (see hermes_state.py — only started_at/ended_at), so
row.get('updated_at') always returned None and the str() coercion was
dead code. Use datetime.now(UTC).isoformat() instead, which reflects
exactly what the field means here: 'the title was refreshed at this
moment'. Drop the dead coercion.

* feat(acp): enrich permission request cards

* feat(web): mobile dashboard UX polish (#28127)

* feat(web): mobile dashboard UX polish

Bottom sheets for sidebar theme/language pickers on narrow viewports with
enter/exit animation and drag-to-close; inline header badges beside titles;
bottom padding on the route outlet for scroll clearance; profiles loading uses a
unicode braille spinner; align profile/cron card actions to the top; viewport-fit
cover and supporting layout tweaks across dashboard pages.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Nix web npm hash and mobile sheet accessibility.

Align fetchNpmDeps in nix/web.nix with web/package-lock.json for CI. Improve BottomPickSheet backdrop labeling, avoid aria-hidden on the dialog during exit animation, and wire theme/language sheets with listbox semantics and localized dismiss labels.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(install.ps1): strip BOM, add -Commit/-Tag pin params, harden git ops

Three install.ps1 improvements pulled from the thin-installer work on
bb/gui (PR #27822) that benefit the canonical CLI install flow on main:

1. Strip UTF-8 BOM from scripts/install.ps1.

   The canonical 'irm <raw URL> | iex' install flow has been broken
   since commit 4279da4db re-introduced a UTF-8 BOM that PR #27224
   had explicitly stripped. PowerShell 5.1's 'irm' returns the
   response body as a string with the BOM surviving as a leading
   \ufeff character; 'iex' then evaluates that string and the parser
   chokes on the invisible character before param(), surfacing as a
   cascade of 'The assignment expression is not valid' errors at
   every param default value.

   File body is verified pure ASCII (no character above byte 127),
   so PS 5.1 with no BOM falls back to Windows-1252 decoding which
   is identical to ASCII for our content. Both install paths work:
     - 'irm ... | iex' (canonical one-liner)
     - 'powershell -File install.ps1' (programmatic / desktop bootstrap)

2. New -Commit and -Tag string params for reproducible pinning.

   Higher-precedence variants of -Branch. When set, the repository
   stage clones $Branch (fast partial fetch) and then 'git checkout's
   the exact ref. Precedence: Commit > Tag > Branch. Honoured by all
   three code paths:
     - Update path (existing valid checkout): fetch + checkout
       --detach <commit|tag> instead of checkout + pull.
     - Fresh clone: clone --branch $Branch, then post-clone
       'git checkout --detach' to the requested ref.
     - ZIP fallback: pick archive URL for the most-specific ref
       (commit -> archive/<sha>.zip, tag -> archive/refs/tags/
       <tag>.zip, else archive/refs/heads/<branch>.zip).

   Used by the Hermes desktop's first-launch bootstrap to pin the
   .exe to the exact commit it was built against, so the cloned
   Hermes Agent tree always matches what the .exe was tested with.
   Also enables release-bundle pinning (e.g. Microsoft Store builds
   pinning to a release tag) and CI reproducibility.

3. EAP=Continue wrap around the new pin-step git invocations.

   'git fetch origin <commit>' writes the routine 'From <url>' info
   line to stderr. Under the script's global $ErrorActionPreference
   = 'Stop' that stderr line is wrapped as an ErrorRecord and
   terminates the script even though fetch+checkout actually succeed.
   Same EAP=Stop + native-stderr footgun we hit during the install.ps1
   hardening pass in Install-Uv, Test-Python, _Run-NpmInstall.

   Wrap both the update-path fetch/checkout block AND the post-clone
   pin block in $ErrorActionPreference = 'Continue' (restored in
   finally). Real failures still caught by $LASTEXITCODE checks.

* fix: add default base_url_override for ollama-cloud provider

* chore(release): add AUTHOR_MAP entry for falasi

* feat(cli): add /update slash command to CLI and TUI (#23854)

* feat: add /update slash command to CLI and TUI

* test(cli): add Python tests for /update slash command

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cli): address Copilot review for /update slash command

Route classic CLI /update through prompt_toolkit modal confirmation and
defer relaunch to the main-thread cleanup path after app.exit(). Tighten
Y/n semantics, add Python wrapper and catalog coverage tests, and assert
/update stays visible in the TUI command catalog.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cli): address review feedback on /update command

- Replace raw input() with _prompt_text_input_modal in _handle_update_command
  to avoid EOF/hang/keystroke-leak races with prompt_toolkit's stdin ownership
- Fix confirmation logic: only proceed on recognized affirmative aliases
  (y/yes/1/ok); cancel on everything else including empty string, typos,
  and unrecognized input — matches all other [Y/n] prompts in the codebase
- Route relaunch through main-thread shutdown path: set _pending_relaunch
  and return False from process_command so process_loop triggers app.exit();
  run() then calls relaunch() after prompt_toolkit has restored terminal modes
  and after cleanup — safe on both POSIX (execvp) and Windows (subprocess+exit)
- Fix misleading docstring in test_update_command.py: the Vitest only covers
  the TypeScript slash handler that emits code 42, not the Python wrapper
  branch that acts on it
- Rewrite tests to use SimpleNamespace pattern (like test_destructive_slash_confirm)
  so _prompt_text_input_modal can be stubbed directly
- Add Python test for _launch_tui exit-code-42 → relaunch branch in main.py

Agent-Logs-Url: https://github.com/NousResearch/hermes-agent/sessions/f6da68cf-e7b1-4b7a-aed6-3d4b0f523bdb

Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com>

* fix(cli): polish test fixtures for /update command

- Remove unused _prompt_text_input from SimpleNamespace stub
- Use pytest.fail sentinel in managed-install guard test to catch unexpected modal invocations

Agent-Logs-Url: https://github.com/NousResearch/hermes-agent/sessions/f6da68cf-e7b1-4b7a-aed6-3d4b0f523bdb

Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com>

* chore: re-trigger CI after Copilot review fixes

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: austinpickett <260188+austinpickett@users.noreply.github.com>

* feat(skills): add baoyu-article-illustrator skill

* feat(skills): adapt baoyu-article-illustrator for Hermes

Adapts the upstream baoyu-article-illustrator skill (verbatim-copied in
the previous commit) to Hermes' tool ecosystem, matching the pattern
used by baoyu-infographic.

- Metadata: openclaw → hermes; add author, license, tags, category
- Triggering: slash command + CLI flags → natural language
- User config: remove EXTEND.md, first-time-setup, preferences-schema
- User prompts: AskUserQuestion (batched) → clarify (one at a time)
- Image gen: baoyu-imagine → image_generate (describe refs in prompt text)
- Platform: drop Windows/PowerShell; Linux/macOS only
- File ops: switch to write_file / read_file
- Watermark: opt-in per-article instead of EXTEND.md-driven
- Add PORT_NOTES.md describing the adaptation and sync procedure

Style, palette, and prompt/system.md reference files are verbatim copies
and are the sync points with upstream.

* fix(skills): align article-illustrator with real Hermes tool capabilities

Addresses review feedback on #13193:

1. Reference-image flow no longer assumes write_file/read_file handle
   binaries. vision_analyze produces a textual description; the binary
   is optionally copied via terminal (cp/curl). The description is what
   gets embedded in prompts.

2. image_generate's URL-only return is now explicit. Step 6 downloads
   the returned URL to local disk via terminal (curl -sSL -o ...), then
   verifies non-zero size before proceeding.

3. Removed "Please use nano banana pro..." line from prompts/system.md —
   the backend is user-configured and not agent-selectable, so routing
   hints in the prompt are misleading.

PORT_NOTES.md updated: prompts/system.md is no longer verbatim, and the
file-ops/backend-selection rows now reflect Hermes' actual tool surface
(write_file/read_file for text, terminal for binaries and URL downloads,
vision_analyze for reading images).

* chore(skills/baoyu-article-illustrator): tighten description, add platforms, regen docs

* chore(release): map Jack Yang contributor email

Adds the contributor email mapping for Jack Yang (@0xjackyang) so future
release-note generation attributes commits correctly.

Salvage of #27964 by @0xjackyang.

* chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 7

Pre-stages AUTHOR_MAP entries for 5 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 7). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- 02356abc (#28286 — wecom WSMsgType.CLOSING)
- burjorjee (#28201 — inline-shell timeout guard)
- oseftg (#28168 — natural response ending: emoji + caret)
- rudi193-cmd (#28241 — empty credential pool entries)
- sadiksaifi (#27982 — kanban horizontal scroll)

Per references/batch-pr-salvage-may14-additions.md.

* fix(wecom): handle WSMsgType.CLOSING to prevent CPU spin

The WeCom adapter's _read_events() loop only handled CLOSE, CLOSED,
and ERROR websocket message types. When the server initiates a graceful
shutdown, aiohttp returns WSMsgType.CLOSING before the connection is
fully closed. This message type was not handled, causing the receive()
call to return immediately in a tight loop while self._ws.closed
remained False. The result was 100% CPU usage on the asyncio event loop.

Add WSMsgType.CLOSING to the set of terminal message types that raise
RuntimeError("WeCom websocket closed"), allowing _listen_loop() to
enter its normal reconnect backoff path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): treat empty credential pool entries as unauthenticated

Fixes #28140

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: include hermes_plugins in gateway.log component filter

gateway.log uses a _ComponentFilter that only passes records from
loggers starting with ('gateway',). Plugin modules are loaded under
the hermes_plugins.* namespace, so all plugin log output is silently
dropped from gateway.log.

This makes plugin registration — which directly affects gateway hooks
(pre_gateway_dispatch, transform_llm_output, etc.) — invisible in
the gateway-specific log. Operators debugging gateway behavior check
gateway.log and see no plugin activity, even when plugins are working
correctly.

Add 'hermes_plugins' to the gateway component prefixes tuple so
plugin log messages appear in gateway.log.

Closes #28138

* fix(gateway): align kanban artifact _IMAGE_EXTS with response dispatch

_deliver_kanban_artifacts used a broader _IMAGE_EXTS that included
.bmp, .tiff, and .svg. These three extensions are absent from the
equivalent set in _deliver_media_from_response (line 10661), which
intentionally routes them through send_document rather than
send_multiple_images (comment near line 10522 notes that Telegram
sendPhoto recompresses and rejects non-raster formats).

Routing .svg (XML text), .bmp, or .tiff through the photo API causes
send_multiple_images to raise on most platforms; the exception is caught
and logged as a warning, silently dropping the artifact. Aligning the
two sets ensures kanban deliverables with these extensions follow the
same send_document path as regular agent responses.

No behaviour change for .png/.jpg/.jpeg/.gif/.webp.

* fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze

Background process non-PTY path used stdin=subprocess.PIPE unconditionally,
creating an orphan pipe that was never written to and never closed. Child
processes that read stdin would block indefinitely, competing with the
parent's prompt_toolkit event loop for terminal ownership and causing
complete keyboard lockout.

Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin
reads instead of blocking forever. For interactive stdin, the PTY path
(which has its own independent PTY via ptyprocess.PtyProcess.spawn) should
be used instead.

Fixes #17959

* chore(release): alias stale-ID salvage commit for @LifeJiggy (#28317)

* fix(process-registry): detach stdin from background subprocesses to prevent keyboard freeze

Background process non-PTY path used stdin=subprocess.PIPE unconditionally,
creating an orphan pipe that was never written to and never closed. Child
processes that read stdin would block indefinitely, competing with the
parent's prompt_toolkit event loop for terminal ownership and causing
complete keyboard lockout.

Change to stdin=subprocess.DEVNULL so children get immediate EOF on stdin
reads instead of blocking forever. For interactive stdin, the PTY path
(which has its own independent PTY via ptyprocess.PtyProcess.spawn) should
be used instead.

Fixes #17959

* chore(release): alias stale-ID salvage commit for LifeJiggy

PR #28315 was salvaged with a wrong noreply numeric ID (192385615 vs
the correct 141562589). The commit on main is correctly authored to
LifeJiggy by username, but the noreply email doesn't match AUTHOR_MAP.
Adds an alias so release-notes generation maps both forms to the same
contributor.

---------

Co-authored-by: LifeJiggy <192385615+LifeJiggy@users.noreply.github.com>

* fix: elevate plugin discovery failures from debug to warning

Plugin discovery exceptions in gateway startup (gateway/run.py) and
CLI startup (hermes_cli/main.py) are caught and logged at DEBUG
level, making them invisible at the default INFO log level.

If any plugin import fails — syntax error, missing dependency, import
cycle — operators get zero indication unless they bump the log level
to DEBUG. This makes broken plugins appear enabled but silently
non-functional.

Change both locations to logger.warning() so failures are visible at
production log levels.

Closes #28137

* fix: treat inline-shell timeout guard as timeout

* fix(acp): resolve /tmp symlink before workspace auto-approve check on macOS

Path.resolve() follows the /tmp -> /private/tmp symlink on macOS, so
str(path).startswith("/tmp/") is always False for temp-dir paths.
The "Accept Edits" (workspace_session) mode silently refused to
auto-approve every /tmp write on macOS, breaking the documented
behaviour and making the existing test fail on this platform.

Fix: keep the raw expanded path (pre-resolve) for the /tmp prefix
check and continue using the resolved form only for the cwd
relative_to() call where symlink resolution is correct behaviour.

* fix(kanban): single-row horizontal scroll for board columns

Switch .hermes-kanban-columns from auto-fit CSS grid to a flex row with
overflow-x: auto and a hidden scrollbar (scrollbar-width / ::-webkit-
scrollbar), and pin .hermes-kanban-column to flex: 0 0 280px so columns
sit side-by-side at a fixed width instead of wrapping into a 2xN grid.

Page vertical scroll is unaffected: each column already caps at
max-height: calc(100vh - 220px), so the container never grows tall
enough to introduce its own vertical scrollbar.

* fix(approval): surface pending-approval state with explicit marker visible to LLM

When a tool call requires user approval in the non-blocking gateway path,
the LLM previously received a result that was indistinguishable from a
failed tool call (exit_code=-1, error=message). The LLM could not tell
whether the tool was pending approval, had returned empty results, or had
failed silently — causing it to burn context on wrong hypotheses.

Fix changes the result format to include:
- status: pending_approval (clear state name)
- approval_pending: True (explicit boolean for LLMs to detect)
- error: cleared to empty string (removes misleading error signal)

This lets the LLM reason about approval latency vs actual errors,
short-circuiting the previous silent failure mode.

Fixes #14806

* fix: recognize emoji and caret as natural response endings

GLM models via Ollama report finish_reason='stop' even when the
response was truncated by max_tokens. The continuation mechanism
uses _has_natural_response_ending() as one of the heuristics to
detect whether the response was genuinely finished.

Currently only ASCII punctuation and CJK punctuation are recognized.
This means any response ending with an emoji (e.g. ⚡, 👍) or the
caret character ^ (common in French ^^ smiley) is not recognized as
naturally ended, triggering a false-positive continuation where the
model receives 'Continue where you left off' and produces garbled
output.

Add:
- ^ (caret) to the punctuation set
- Unicode emoji range (codepoint >= 0x1F300) as natural ending

This only affects GLM/Ollama users but the fix is safe for all
backends since _has_natural_response_ending() is only consulted
inside the continuation flow.

* chore(release): pre-stage AUTHOR_MAP for May 2026 LHF batch group 8 (#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (#28032 — x.com/status fallbacks link-like)
- colin-chang (#28245, #28249, #28251 — gateway + mattermost fixes)
- felix-windsor (#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (#28205 — charizard completion menu contrast)
- iqdoctor (#28095 — windows installer docs)
- joe102084 (#28151 — whitespace-only cron responses)
- jvinals (#27936 — Slack U-IDs → DM channel)
- maxmilian (#28267 — ModelPickerDialog portal)
- samggggflynn (#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.

* fix: add pre_start() to _IncomingHandler for dingtalk SDK compatibility

The dingtalk-stream SDK calls pre_start() on every registered handler
before opening the WebSocket connection. Without this method, the SDK
raises AttributeError and kills the stream connection, causing DingTalk
to be unable to connect via Stream Mode.

* fix(windows): handle redirected stdout in _cprint fallback

Wraps _pt_print in try/except with a print() fallback. When a
kanban worker's stdout is piped to a log file, prompt_toolkit
raises NoConsoleScreenBufferError (Windows) or OSError (other)
because there is no real console buffer. The fallback keeps
worker output flowing instead of crashing.

* chore(release): alias stale-ID salvage commit for @Grogger (#28334)

PR #28330 was salvaged with a wrong noreply numeric ID (18091625 vs
the correct 7065068). The commit on main is correctly authored to
Grogger by username, but neither noreply form was in AUTHOR_MAP.
Adds both so release-notes generation maps them to @Grogger.

* fix(aux): remove stale session_search model menu entry

* fix(tui): keep x status citation fallbacks link-like

* fix(xai-oauth): quarantine dead tokens on terminal refresh failure

resolve_xai_oauth_runtime_credentials() called _refresh_xai_oauth_tokens()
with no try/except. A terminal refresh failure (HTTP 400/401/403 —
invalid_grant, token revoked) propagated without clearing the dead
access_token / refresh_token from auth.json, causing every subsequent
session to retry the same doomed network request.

Add a try/except around the refresh call that mirrors the existing
credential_pool.py quarantine: when _is_terminal_xai_oauth_refresh_error
identifies a non-retryable failure, clear the dead token fields from
auth.json and write a last_auth_error diagnostic marker so future calls
fail fast with a clear relogin_required error instead of hitting the
network.

active_provider is preserved (set_active=False) so multi-provider users
whose chosen provider is not xai-oauth are unaffected.

Tests: two new cases in test_auth_xai_oauth_provider.py cover terminal
quarantine and transient pass-through.

* feat(bg-review): add bundled/pinned skill protection rules to review prompts (#27644)

The background review prompts (_SKILL_REVIEW_PROMPT and
_COMBINED_REVIEW_PROMPT) now include explicit protection rules
for bundled, hub-installed, and pinned skills — aligning with
the curator's existing policy at curator.py L345/350.

Before this change, bg-review could freely rewrite bundled skills
like 'hermes-agent' or pinned skills, while the 7-day curator
explicitly skips them.

The review agent now sees:
  • Bundled skills (shipped with Hermes)
  • Hub-installed skills (installed via hermes skills install)
  • Pinned skills (marked via hermes curator pin)
If only protected skills need updating, the review says
'Nothing to save.' and stops.

Fixes #27644

* fix(web): portal Change Model modal so it renders above the app sidebar

The dashboard's main column is `relative z-2` (App.tsx), which creates a
stacking context that traps fixed descendants below the app sidebar
(`z-50`). `ModelPickerDialog` renders `fixed inset-0 z-[100]` inline,
so its z-100 is scoped to z-2 and the sidebar covers its left edge.

The bug is visible across all themes but only obvious in the Large theme
variants (Hermes Teal (Large), etc.) where the larger root font widens
the dialog into the sidebar's column. Toast.tsx already documents the
same trap and uses the same `createPortal(..., document.body)` escape.

This commit ports the picker; the same pattern affects other inline
z-[100] modals in the dashboard (OAuthLoginModal, Cron / Models /
Profiles page modals) and is left for a follow-up — keeping this PR
scoped to the reporter's specific case.

Fixes #28103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gateway): exit code 75 on service restart so launchd relaunches

When the gateway receives SIGUSR1 (graceful restart via launchd_restart),
the SIGUSR1 handler calls request_restart(via_service=True) and the
gateway shuts down cleanly with exit code 0.

However, the generated launchd plist uses KeepAlive → SuccessfulExit →
false, meaning launchd only relaunches on *non-zero* exit codes.  A
clean exit(0) is treated as "successful, don't restart", so the
gateway stays down after /restart, /update, or SIGUSR1.

The systemd unit template already uses RestartForceExitStatus=75 for the
same scenario.  Mirror that convention: when _restart_via_service is
True, raise SystemExit(75) so launchd's SuccessfulExit=false policy
triggers a relaunch.

Closes #28135

* fix: guard json.loads() against invalid TTS and skill_view responses

Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.

1. gateway/run.py — text_to_speech_tool() result in voice reply path.
   A TTS failure that returns an error string instead of JSON crashes
   the voice reply handler, killing the message response entirely.

2. cron/scheduler.py — skill_view() result when loading skills for
   cron jobs. A corrupted or missing skill file that returns an error
   string instead of JSON crashes the cron tick, preventing all jobs
   from executing that cycle.

Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.

* fix(gateway): bridge gateway_restart_notification from YAML platform sections

Two related bugs in gateway/config.py prevented per-platform
gateway_restart_notification from working through config.yaml:

1. The shared-key bridging loop (load_gateway_config) omitted
   'gateway_restart_notification', so the key never landed in
   platform_data['extra'] even when set under e.g. 'discord:' or
   'mattermost:' sections.

2. PlatformConfig.from_dict() only read gateway_restart_notification
   from the top-level data dict, ignoring the 'extra' sub-dict where
   bridged keys are stored.

Fix: add the key to the bridging loop, and add an 'extra' fallback
in from_dict() so that round-tripped values (YAML → bridged → extra
→ from_dict) resolve correctly.

Impact: users can now set gateway_restart_notification: false per
platform in config.yaml instead of relying on env vars or the
global platforms: block.

* feat(kanban): add auto_promote_children config toggle

When the kanban auto-decomposer fans a triage task into child tasks,
recompute_ready() immediately promotes parent-free children to 'ready'
so the dispatcher picks them up. Some users want a manual workflow
where children stay in 'todo' for review before dispatch.

Add 'kanban.auto_promote_children' config key (default: true):
- false: children stay in 'todo' after decomposition
- true: existing behavior (auto-promote to 'ready')

Changes:
- kanban_db.py: decompose_triage_task() gains auto_promote param
- kanban_decompose.py: reads auto_promote_children from config
- kanban dashboard API: exposes the new setting in GET/PUT /orchestration

Closes #28016

* fix: wrap _pool_may_recover_from_rate_limit call through run_agent namespace

The conversation_loop.py references _pool_may_recover_from_rate_limit which
was defined in run_agent.py. After the conversation-loop extraction refactor,
the helper was no longer in the same module scope. Wrap the call as
_ra()._pool_may_recover_from_rate_limit() to route through the run_agent
monkeypatch namespace where the helper is available.

Adds regression test in test_gemini_fast_fallback.py.

Fixes: MAILROOM Email Triage NameError, OPS Execution Monitor NameError.

* fix(tui): improve charizard completion menu contrast

* docs(windows): avoid piping installer directly into iex

* fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS

Qwen3.x and DeepSeek-V3.x default to chatty/hallucinatory tool use without
enforcement steering — agents narrate "calling tool X" without actually
emitting a tool call, or run partial loops. Both model families fit the
same failure pattern TOOL_USE_ENFORCEMENT_GUIDANCE was already injected
for (gpt, codex, gemini, gemma, grok, glm).

Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>

Squashed salvage of:
- 403e567ce fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS
- 9433eabe7 test(agent): use realistic qwen-plus identifier in enforcement test

Fixes #28079.

* fix(send_message): resolve Slack user IDs to DM channel IDs

The _SLACK_TARGET_RE regex only matched IDs starting with C (channel),
G (group), or D (direct message). Slack user IDs start with U, causing
'Could not resolve' errors when trying to send DMs to specific users.

Changes:
- Expand _SLACK_TARGET_RE to accept U-prefixed IDs (user IDs)
- Add conversations.open fallback to resolve user IDs to DM channel
  IDs before sending, since chat.postMessage requires a conversation ID

Fixes #ISSUE_NUMBER

* fix(gateway): tighten MEDIA extraction regex + silent skip on file-not-found

Three related fixes for the MEDIA:<path> extraction pipeline that
caused 'file not found' noise in platform channels:

1. run.py — tighten tool-result MEDIA regex from \S+ (any non-
   whitespace) to require a path pattern with known extensions.
   Prevents LLM-generated placeholder paths like
   'MEDIA:/path/to/example.mp4' from being captured as real media.

2. base.py — remove the |\S+ fallback in extract_media() that
   catches anything non-whitespace as a potential MEDIA path.
   This was the primary cause of false positives — strings like
   '' in tool output were captured as MEDIA: paths.

3. mattermost.py — replace the file-not-found error message sent
   to the channel with a silent logger.warning() skip. When a
   path extracted by MEDIA doesn't exist on disk, the channel
   no longer gets a noisy '(file not found: ...)' message.

Impact: eliminates the persistent 'file not found' spam in
Mattermost channels caused by over-broad MEDIA regex patterns
matching non-path text in tool output.

* fix(xai-oauth): split 403 (tier/entitlement) from 400/401 in token endpoint

xAI's token endpoint returns HTTP 403 to the OAuth grant when the
account isn't on the allowlist for API access (e.g. standard
SuperGrok subscribers — see #26847). Treating it like a stale-token
400/401 made ``format_auth_error`` append "Run ``hermes model`` to
re-authenticate", which is misleading because re-login can't change
xAI's tier decision.

Split 403 off in both ``refresh_xai_oauth_pure`` and the loopback
login token exchange:

* New error code ``xai_oauth_tier_denied`` with ``relogin_required=False``
* Message explains the entitlement gate and points at the
  ``XAI_API_KEY`` + ``provider: xai`` fallback
* 400/401 still set ``relogin_required=True`` as before
* 5xx still set ``relogin_required=False`` as before

* fix(run-agent): treat any 403 on xai-oauth as entitlement to stop refresh-loop

The existing ``_is_entitlement_failure`` heuristic only fires when
the response body contains specific substrings ("do not have an
active Grok subscription", etc.). xAI has been seen to 403 standard
SuperGrok subscribers with a terser body that doesn't match those
keywords (#26847), and the recovery path would then mint a fresh
token, get a fresh 403, and loop until Ctrl+C.

Add a defense-in-depth check at the recovery call site: any 403 on
``provider == "xai-oauth"`` short-circuits ``try_refresh_current``
so the error surfaces immediately with the friendly hint from
``_summarize_api_error``. Keeps the existing keyword path for all
other providers untouched.

* test(xai-oauth): pin tier-denied 403 behavior + docs warning for #26847

Tests:

* ``test_refresh_xai_oauth_pure_403_marked_tier_denied_not_relogin`` —
  refresh-403 raises ``xai_oauth_tier_denied`` with
  ``relogin_required=False`` and the API-key fallback hint in body.
* ``test_format_auth_error_tier_denied_does_not_suggest_relogin`` —
  the renderer does not append "Run ``hermes model``" for the new
  code.
* ``test_recover_with_credential_pool_skips_refresh_on_bare_403_for_xai_oauth`` —
  bare ``{"reason":"forbidden","message":"Forbidden"}`` body (which
  does not match the existing keyword heuristic) still short-circuits
  ``try_refresh_current`` on xai-oauth.

Docs:

* Drop the "(any active tier)" claim from the xai-grok-oauth guide,
  add a top-of-page warning callout, and a Troubleshooting section
  for the 403-after-login case pointing at ``XAI_API_KEY`` +
  ``provider: xai`` as the documented fallback.

* fix: handle whitespace-only cron responses

* fix(cli): preserve cron asterisks in strip mode

* fix(mattermost): resolve thread root_id and route progress to threads

Two Mattermost thread-related bugs:

1. _resolve_root_id() — Mattermost CRT requires root_id to be the
   thread root post. Using any reply's own ID as root_id causes
   '400 Invalid RootId'. Add _resolve_root_id() that walks up the
   post chain via API to find the actual root, and apply it in
   send(), _send_url_as_file(), and _send_local_file().

2. _progress_reply_to — The condition in run.py only checked
   Platform.FEISHU, missing Mattermost entirely. This caused tool
   progress messages to always land in the main channel instead of
   the thread. Add Platform.MATTERMOST to the condition so
   progress messages are routed to threads when reply_mode=thread.

Impact: Tool progress messages now appear in Mattermost threads
instead of flooding the main channel; thread replies no longer
fail with Invalid RootId when the reply target is itself a reply.

* feat(kanban): archive --rm to hard-delete archived tasks

Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to
permanently remove already-archived tasks with cascading cleanup of
links, comments, events, runs, and notify-subs. Safety guard: only
archived tasks can be deleted; active/blocked/done must be archived
first.

Cherry-picked from #19964 onto current main (severe stale base, applied
manually to preserve substance only).

* feat(proxy): add xai upstream adapter for Grok via OAuth

* chore(release): map @yannsunn for PR #28064 xai proxy adapter salvage

* docs(skill): align kanban dispatcher failure_limit text with current default

* fix(oauth): add manual-paste fallback for browser-only remote consoles

xAI Grok OAuth (and Spotify) use a loopback redirect to
``http://127.0.0.1:<port>/callback`` to capture the authorization
code. That works when the browser and Hermes run on the same
machine, and the SSH tunnel recipe handles the regular remote
case. It breaks completely on **browser-only remote consoles**
(GCP Cloud Shell, GitHub Codespaces, AWS EC2 Instance Connect,
Gitpod, Replit, …) where the user has a browser but no real SSH
client to forward a port — the redirect to 127.0.0.1 on the
remote VM simply isn't reachable from the laptop, and there's
nothing the existing flow can do about it (#26923).

This commit adds the foundation for a manual-paste fallback:

* ``_is_remote_session`` now also recognises Cloud Shell,
  Codespaces, Gitpod, Replit, StackBlitz (in addition to SSH),
  so the existing tunnel hint at least fires in those
  environments.
* ``_parse_pasted_callback`` accepts any of: a full
  ``http(s)://...?code=...&state=...`` URL, a bare ``?code=...``
  query string, a bare ``code=...&state=...`` fragment, or a
  bare opaque code value.  Returns the same dict shape the HTTP
  callback handler produces, so the caller's state / error
  validation works unchanged (no CSRF bypass).
* ``_prompt_manual_callback_paste`` reads stdin with a clear
  multi-line explanation of what's happening and what to paste.
* ``_xai_oauth_loopback_login`` gains a ``manual_paste`` kwarg
  that skips the HTTP listener entirely.  The redirect_uri,
  PKCE verifier, state, and nonce are byte-identical to the
  loopback path so xAI's token endpoint can't tell the
  difference at the protocol level.
* ``_print_loopback_ssh_hint`` now also mentions
  ``--manual-paste`` so users without a real SSH client see a
  path forward instead of a dead-end tunnel recipe.
* ``_login_xai_oauth`` threads ``args.manual_paste`` into the
  loopback helper.

* feat(cli): wire --manual-paste into ``hermes auth add`` and ``hermes model``

Register the new ``--manual-paste`` flag on both entry points and
thread it through to the xAI loopback login:

* ``hermes auth add xai-oauth --manual-paste`` — pool-add path,
  forwarded inside ``auth_commands.handle_auth_add``.
* ``hermes model --manual-paste`` — model-picker path, forwarded
  by ``_model_flow_xai_oauth`` into the synthetic ``argparse.Namespace``
  it passes to ``_login_xai_oauth``.  The picker also now forwards
  ``--no-browser`` and ``--timeout`` for consistency (previously
  hardcoded to defaults regardless of CLI flags).

Help text on both flags points at #26923 and names the
browser-only remote consoles (Cloud Shell, Codespaces, EC2
Instance Connect) so users searching ``hermes --help`` can find
the workaround.

* test+docs(oauth): pin manual-paste semantics and document browser-only path (#26923)

Tests (``tests/hermes_cli/test_auth_manual_paste.py``):

* 9 parametrised + scalar cases for ``_is_remote_session`` covering
  the new Cloud Shell / Codespaces / Gitpod / Replit / StackBlitz
  env vars (plus the existing SSH ones).
* 9 cases for ``_parse_pasted_callback`` covering every paste form
  (full URL, https URL with extra params, bare ``?code=...``, bare
  ``code=...`` fragment, bare opaque value, error+description,
  empty, whitespace-only, malformed URL).
* 3 cases for ``_prompt_manual_callback_paste`` (happy path, EOF,
  Ctrl-C).
* 3 end-to-end ``_xai_oauth_loopback_login(manual_paste=True)``
  cases: the HTTP server MUST NOT be started (asserted via a
  callable that raises if invoked), wrong state still rejected
  with ``xai_state_mismatch`` (no CSRF bypass), and empty paste
  surfaces ``xai_code_missing``.
* SSH-hint mention test ensures the ``--manual-paste`` instruction
  is printed in the remote-session hint.

Docs:

* ``oauth-over-ssh.md`` — new "Browser-only remote (Cloud Shell /
  Codespaces / EC2 Instance Connect)" section with the
  ``--manual-paste`` recipe, plus a TL;DR note for the new flag.
* ``xai-grok-oauth.md`` — short subsection pointing at the same
  recipe and the OAuth-over-SSH guide anchor.

* docs(kanban): document max-retries task override

* docs(kanban): document inline create shortcuts

* test(kanban): cover default board dashboard pin

* docs: ignore box diagrams in ascii guard

Wrap existing box-drawing diagrams with ascii-guard markers so docs-site checks pass when website docs are touched.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: per-task model override for kanban workers

- Add model_override field to Task class and tasks schema
- Add migration for existing databases
- Spawn worker with -m model when model_override is set

* test(kanban-dashboard): cover _task_dict task_age fallback

The fix in 061a1830 added an outer try/except in plugin_api._task_dict
so that a future failure mode in kanban_db.task_age (anything _safe_int
doesn't already absorb) cannot 500 the GET /board response. The
_safe_int / task_age corruption paths got regression coverage in
tests/hermes_cli/test_kanban_db.py, but the OUTER fallback contract
remained untested -- meaning a refactor that drops the try/except would
not be caught by CI.

Pin that contract from both consumers of _task_dict:
- GET /board returns 200 with the literal fallback age dict for the
  affected card (other cards continue to render via the same path)
- GET /tasks/:id (drawer view) returns 200 with the same fallback,
  so a single corrupt task can't block its own drawer

Both tests force task_age to raise RuntimeError rather than ValueError
on '%s', because ValueError is absorbed by _safe_int and never reaches
the outer try/except -- testing that path would only re-cover what
test_kanban_db.py already pins.

Manually verified the regression discipline:
  git checkout 061a1830^ -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both FAIL with 500
  git checkout HEAD -- plugins/kanban/dashboard/plugin_api.py
  pytest -k task_age_exception        # both PASS

* fix(kanban): clear _INITIALIZED_PATHS in remove_board so recycled DBs re-init schema

Archiving or deleting a board via remove_board() leaves the path's
"schema already initialized" entry in the module-level cache. A
concurrent connect(board=<slug>) call (e.g. the dashboard event-stream
poll loop) then:

  1. resolves the same kanban.db path,
  2. recreates the directory + an empty sqlite file because
     connect() does mkdir(parents=True, exist_ok=True),
  3. skips the CREATE TABLE pass because the cache entry says the
     schema is already in place,
  4. errors on the next read with `no such table: task_events`.

Drop the cache entry before mutating the filesystem so the fresh file
gets a proper schema init on next connect(). Applies to both
archive=True (rename) and archive=False (rmtree) branches.

Fixes #23833.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(web): add Cache-Control: no-store to plugin static file serving

Prevents browser caching of stale dashboard plugin JS files that may
contain bugs already fixed upstream (e.g. COLUMN_LABEL undefined).

* fix(kanban): seed bundled skills (e.g. kanban-worker) on kanban init

Closes #23725

* fix(kanban): ignore stale HERMES_KANBAN_BOARD for removed boards

* fix(kanban): keep board-management commands independent from board override

* fix(kanban): preserve notifier_profile for dashboard home subscriptions

* fix(kanban): promote dependents when a parent is archived

* fix(cli): make kanban specify max_tokens configurable

* fix(kanban): sync slash subcommands with live parser

* fix(kanban): promote blocked tasks when parent dependencies complete

recompute_ready only scanned 'todo' tasks for promotion, ignoring
'blocked' tasks entirely. When a task was blocked (e.g. by the circuit
breaker) and its parent dependencies later completed, the task stayed
stuck in 'blocked' forever unless manually unblocked.

Now recompute_ready also scans 'blocked' tasks. When all parents are
done/archived, the blocked task is promoted to 'ready' with failure
counters reset — equivalent to an automatic unblock.

Includes a regression test for the blocked-parent-done promotion path.

* fix(kanban): use 'is not None' check for max_runtime_seconds in create_task

max_runtime_seconds=0 was being silently coerced to None due to a falsy
check (if max_runtime_seconds). Zero is a valid value that causes the
dispatcher to immediately time out a task. The adjacent max_retries
parameter already used the correct 'is not None' pattern.

Fixes the inconsistency by aligning max_runtime_seconds with max_retries.

* fix(kanban): reset failure counters on unblock_task

When a task is manually unblocked (blocked → ready/todo), the
consecutive_failures counter and last_failure_error were left intact.
The next failure would immediately re-trip the circuit breaker because
the counter was still at or above the failure limit.

Reset both fields on unblock so the task gets a fresh retry budget.

Includes a regression test that verifies counters are zeroed.

* fix(kanban): fingerprint crash errors to prevent fleet-wide retry exhaustion

When a systemic failure (provider outage, auth expiry, OOM) crashes
multiple workers simultaneously, detect_crashed_workers increments
each task failure counter independently. The circuit breaker only
trips after N × failure_limit retries across the fleet.

Fingerprint crash errors by normalizing host-specific details (PIDs,
timestamps). When 3+ tasks crash with the same fingerprint in a
single detection cycle, immediately trip the circuit breaker
(failure_limit=1) instead of waiting for repeated failures.

Isolated crashes (unique fingerprints) retain their normal retry
budget. Protocol violations continue to trip immediately.

Includes regression tests for systemic and isolated crash paths.

* fix(kanban): align board_exists with board discovery rules

* fix(kanban): demote ready children when a parent is reopened

* fix(kanban): serialize DB initialization

* fix(kanban): task_age() tolerates ISO-8601 timestamps

Prevents ValueError crash in dashboard get_board() when a task has
an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch
int. Adds _to_epoch() helper that normalises both formats.

* Fix Kanban dashboard initial board selection

* fix(kanban): persist worker session metadata on completion

Salvages #25579 by @wesleysimplicio. Stamps task_runs.metadata.worker_session_id
from HERMES_SESSION_ID on kanban_complete. Cherry-picked the substantive
commit (not the AUTHOR_MAP fixup tip) onto current main.

* fix(kanban): make claim ttl configurable

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* fix(kanban): pass accept-hooks to worker chat subprocess

* feat(kanban): add board-level default workdir (#25430)

* docs(kanban-worker): document notification routing configuration

* fix(kanban): preserve worker tools with restricted toolsets

* fix(kanban): make legacy task migration idempotent

(cherry picked from commit 293f1c3a7241b0117669e049d9aa746c9645ac90)

* fix: harden Kanban worker Hermes command resolution

* feat(kanban): allow trimmed task comments

SS-1647 live SHIP validation: real code + tests for kanban comment --max-len.

* fix: show scheduled kanban tasks in dashboard

* fix: assign single-task kanban decompositions

* fix(kanban-dashboard): make Orchestration mode checkbox label static

The checkbox label echoed its state ("Auto (default)" / "Manual") instead
of describing the action, so a checked box reading "Auto" parsed as a
status indicator rather than a control. The accompanying sub-description
was also static and started with "When on, ...", which read awkwardly
when the box was unchecked.

Replace the dynamic label with a static action label
("Auto-decompose triage tasks") and flip the sub-description between the
two modes so it stays accurate either way. The top-of-page Orchestration
pill is unchanged — that one is intentionally a status badge / toggle.

Fixes #28178

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(env): add HERMES_KANBAN_DISPATCH_IN_GATEWAY override (#21956)

Salvages the env-vars docs portion of #21956 by @Bartok9.
The ascii-guard-ignore tags from the original PR already landed on main.

* fix(kanban): close sqlite connection on init failure to prevent fd leak

Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema
init raises after sqlite3.connect() succeeds, the new connection was
leaking. Wrap the body in try/except so the connection is closed before
the exception propagates.

* fix(kanban): don't crash dispatched workers when kanban-worker skill is absent

Salvages #27372 by @oemtalks. The dispatcher unconditionally injected
`--skills kanban-worker` into every worker spawn, but worker profiles
sometimes don't have that bundled skill in their skills dir, which is
fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`).

Adds `_kanban_worker_skill_available(hermes_home)` and only injects the
flag when the skill resolves. The MANDATORY lifecycle still ships via
KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.

* fix(packaging): ship dashboard plugin assets in wheel

Salvages #23737 by @LeonSGP43. Adds plugins/* manifest.json and dist/
glob entries to setuptools package-data so wheel installs ship the
bundled dashboard plugin assets (kanban, achievements, etc.). Without
these, /api/dashboard/plugins can't discover plugin assets outside a
source checkout.

* docs(kanban): document worker protocol auto-blocks

Salvages #21585 by @helix4u. Documents the protocol_violation event
(worker exits successfully while task is still running), adds
--max-retries to the create flag list and --failure-limit to dispatch.

* fix(oneshot): pass fallback_providers from profile config to AIAgent

Salvages #23368 by @uzunkuyruk. Oneshot workers (e.g. kanban workers
spawned via 'hermes -p <profile> chat -q ...') were not honouring the
profile's fallback_providers / fallback_model chain because oneshot.py
never read the config and never passed fallback_model= to AIAgent.

Reads cfg.get('fallback_providers') (new list format) or
cfg.get('fallback_model') (legacy single-dict) with the same
normalization cli.py applies, then forwards as fallback_model=_fb.

* fix(kanban): reject direct running transitions in dashboard bulk updates

Salvages #24050 by @kronexoi. The single-task PATCH already rejects
direct status='running' since it bypasses the dispatcher/claim invariant,
but the bulk-update endpoint still accepted it. Aligns bulk with single
by emitting an error result row for any 'running' entry.

* feat(kanban): add initial-status for human-ops cards

Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag
(running|blocked, default running) to 'kanban create', threaded through
kanban_db.create_task() and the kanban_create tool schema. 'blocked'
parks the task directly in the blocked column for R3 human-ops review,
skipping the brief running-to-blocked transition.

Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and
slash-handler error formatting changes that were bundled in the
original PR — those should ship as their own focused changes if still
wanted.

* fix(kanban): release scratch workspace and tmux session on task completion

Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace()
and _cleanup_worker_tmux() after marking a task complete.

Scratch workspaces (used by swarm agents) accumulate on disk — hundreds
of MB per task, never released. Stale tmux sessions from completed
agents also persist indefinitely.

Both gates are safe:
- workspace_kind == 'scratch' gate preserves user worktree/dir workspaces
- tmux #{pane_dead} == 1 gate only kills sessions where the worker has
  already exited
- best-effort: cleanup failures never block task completion

* fix(kanban): honor severity thresholds in diagnostics

Salvages #26431 by @LeonSGP43. Dashboard plugin_api list_diagnostics
was using exact-match (severity == filter), so '--severity warning'
hid 'error' and 'critical' diagnostics. Adds severity_at_or_above()
helper to kanban_diagnostics and uses it in the dashboard endpoint
(CLI already used SEVERITY_ORDER comparison correctly).

* test: isolate Kanban env pins in hermetic fixture

Salvages the substantive part of #22295 by @steezkelly. Adds the
missing HERMES_KANBAN_HOME, HERMES_KANBAN_RUN_ID, HERMES_KANBAN_CLAIM_LOCK,
HERMES_KANBAN_DISPATCH_IN_GATEWAY entries to _HERMES_BEHAVIORAL_VARS so
ambient developer-shell pins on those vars don't bleed into pytest runs.

The frozenset extraction + standalone regression test from the original
PR were dropped to keep the change minimal — main already maintains the
list inline.

* feat(kanban): add max_in_progress config to cap concurrent running tasks

Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config
that caps simultaneously running tasks. When the board already has N
running, dispatcher skips spawning so slow workers (local LLMs,
resource-constrained hosts) don't pile up and time out.

Threads through dispatch_once(max_in_progress=) and gateway dispatcher
config parsing with validation (warns on invalid/below-1 values).

* fix(packaging): ship bundled skills in wheel

Salvages #23738 by @LeonSGP43. Wheel installs were missing skills/ and
optional-skills/ because pyproject's [tool.setuptools.packages.find]
only includes Python packages — the skills directories don't have
__init__.py so they were silently dropped from the wheel.

Adds setup.py with data_files spec emitting skills/* and optional-skills/*
under hermes_agent-<v>.data/data/, and a get_bundled_skills_dir() helper
in hermes_constants that discovers the wheel-installed location via
sysconfig before falling back to a source-checkout path. tools/skills_sync
uses the helper so 'hermes update' works for pip-installed users.

* fix: 4 small surgical bugs

Salvages #23302 by @Bartok9. Four independent one-area fixes:

1. kanban boards delete alias now hard-deletes (not archives) — the
   alias didn't carry --delete, so getattr(args, 'delete', False)
   returned False. Detect boards_action=='delete' explicitly.
2. Gateway auto-title failures no longer leak as user-visible
   warnings — debug-log only since they're not actionable.
3. Background process completion notification snaps truncation to
   the next newline boundary, prepends a marker when content is
   dropped.
4. _cprint() schedules the run_in_terminal coroutine via
   asyncio.ensure_future so output isn't silently dropped from
   background threads (fixes #23185 Bug A). Skips the
   double-print fallback that would fire for mock paths.

* perf(prompt): cache kanban worker guidance at session init

Salvages #24402 by @RyanRana. The KANBAN_GUIDANCE block (~835 tokens)
is session-static — the dispatcher decides at spawn time whether the
process is a kanban worker via the kanban_show tool's check_fn (gated
on HERMES_KANBAN_TASK env var). Re-checking 'kanban_show' in
valid_tool_names and re-loading the reference on every system-prompt
rebuild (init + each context compression) is wasted work.

Caches the resolved string on agent._kanban_worker_guidance once in
agent_init and consumes it in system_prompt.build_system_prompt(),
with a getattr fallback for code paths that bypass agent_init.

* feat(kanban): add --sort option to 'hermes kanban list'

Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc,
priority,priority-desc,status,assignee,title,updated} to 'hermes kanban
list'. Validated against VALID_SORT_ORDERS map; invalid values raise
ValueError. Default behaviour (priority DESC, created ASC) is unchanged
when --sort is omitted.

* docs: add kanban codex lane skill

* feat(kanban): worker visibility endpoints (workers/active, runs/{id}, inspect)

Adds three read-only endpoints to the kanban dashboard plugin so the
SwitchUI workspace (and any other dashboard consumer) can track
workers across tasks without N+1 round-trips through /tasks/{task_id}.

- GET /workers/active
  Single SQL JOIN of task_runs + tasks where ended_at IS NULL,
  worker_pid IS NOT NULL, status='running'. Returns
  {workers: [...], count, checked_at}.

- GET /runs/{run_id}
  Direct lookup of any task_run row by id. Reuses existing
  kanban_db.get_run() helper and _run_dict() serialiser. 404 when
  not found. Mirrors GET /tasks/{task_id} 404 shape.

- GET /runs/{run_id}/inspect
  Live PID stats via psutil.Process.as_dict() — cpu_percent,
  memory_rss_bytes, memory_vms_bytes, num_threads, num_fds, status,
  create_time, cmdline. Short-circuits with alive:false when run
  has ended, has no worker_pid, the pid is gone, or psutil is
  unavailable. AccessDenied surfaces as alive:true with error
  rather than a 500.

11 new tests in tests/plugins/test_kanban_worker_runs.py cover the
empty-board case, running-task case, ended-run filtering,
missing-pid filtering, 404 paths, already-ended inspect, no-pid
inspect, dead-pid inspect, and live-pid inspect (psutil mocked).
All pass.

Companion termination endpoint (POST /runs/{run_id}/terminate) is
intentionally out of scope here — opening a separate issue first
since the RBAC and dispatcher-mediated soft-cancel design needs
maintainer input before code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): map contributor email for attribution check

* test(kanban-dashboard): pin enriched 409 detail and inline error wiring (#26744)

- Existing ``test_patch_drag_drop_move_todo_to_ready`` now asserts the
  enriched 409 detail names the blocking parent (id, quoted title, and
  current status), so the dashboard always has something actionable to
  render.
- New bundle-assertion test ``test_dashboard_surfaces_ready_blocked_error_inline``
  pins the frontend wiring: the ``parseApiErrorMessage`` helper exists,
  the drag/drop banner runs through it, and the drawer maintains a
  visible ``patchErr`` state that's cleared between PATCHes and tasks.

* docs(codex_app_server): document multi-root Kanban writable_roots (#27941)

Update the Codex app-server runtime guide's Kanban section to reflect
the new behaviour:

  * The sandbox override now adds the board DB directory plus every
    Kanban path the dispatcher pinned (HERMES_KANBAN_WORKSPACES_ROOT,
    HERMES_KANBAN_WORKSPACE, legacy HERMES_KANBAN_ROOT) -- deduplicated,
    DB-dir first.
  * The motivation note now includes the cross-mount artifact-write
    scenario (e.g. ``/media/.../kanban-workspaces/...`` on a separate
    drive) and links to issue #27941 so readers can find the original
    bug report.

* fix(gateway): quiet corrupt kanban dispatcher boards

Salvages substantive part of #26490 by @aqilaziz. Detects corrupt board
DBs ("file is not a database" / "database disk image is malformed")
and disables them by fingerprint until they're repaired, instead of
flooding the gateway log with repeated logger.exception tracebacks every
tick.

Cherry-picked the substantive commit (ea5b4ec2a); the tip commit was
an unrelated _is_dir OSError fix for service-path lookup. Dropped a
small test reformat that was bundled in the same commit.

* docs: align kanban readiness docs and smoke tests

Salvages #28199 by @bensargotest-sys. Aligns Kanban docs with current
tool registration: dispatcher-spawned task workers get task tools,
profiles that explicitly enable the kanban toolset get orchestrator
routing tools (kanban_list, kanban_unblock). Corrects failure-limit
text to current default of 2. Hardens the e2e subprocess script to
resolve repo root and use the spawnable default assignee. Updates the
diagnostics severity fixture to assert error below the critical
threshold.

* feat(kanban): surface per-task model_override in show + tool output

Salvages #26897 by @loicnico96. The per-task model_override DB column
already exists on main, but it wasn't exposed in user-facing surfaces.
This adds:
- 'kanban show' prints 'model: <name>' when model_override is set
- kanban_show / kanban_list tool responses include the model_override field

Original branch was stale (PR was authored against an older field name
'model'); applied the substantive surface exposure manually using the
current 'model_override' field name.

* feat(cli): add kanban swarm topology helper

Salvages #26791 by @Niraven. Adds 'hermes kanban swarm' to create a
durable Kanban Swarm v1 graph: a completed root/blackboard card,
parallel worker cards, a verifier gated on all workers, and a
synthesizer gated on the verifier. Stores shared swarm blackboard
updates as structured JSON comments on the root card.

Self-contained: new hermes_cli/kanban_swarm.py module + CLI wiring +
unit tests.

* feat(kanban): add optional board parameter to all MCP tools

Salvages #27598 by @nnnet. Adds optional 'board' parameter to all 9
kanban_* MCP tools via shared _connect helper. Backwards compatible —
omitting board keeps current pinned-board behavior. Useful for
orchestrator profiles that route across multiple boards.

Two-file scope: tools/kanban_tools.py + tests.

* feat(kanban): stamp originating ACP session_id on tasks

Salvages #23208 by @awizemann. Tracks which chat session created a
kanban task so clients can render a per-session board without falling
back to tenant + time-window heuristics.

- Schema: tasks gains nullable session_id TEXT column with index
  (additive migration in _migrate_add_optional_columns).
- ACP: server.py exposes the originating session id via HERMES_SESSION_ID
  with save/restore around the agent loop.
- Tool: kanban_create reads HERMES_SESSION_ID (with explicit override).
- CLI: 'hermes kanban list --session <id>' filter; JSON output exposes
  session_id.

* feat(kanban): wire dispatcher to dispatch review agents from review column

Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task
status and extends dispatch_once to monitor the review column as a
second dispatch source (in addition to the existing ready column).

- Adds 'review' to VALID_STATUSES
- Adds claim_review_task() — atomically transitions review → running
- Adds has_spawnable_review() — health telemetry mirror
- Extends dispatch_once with a review column dispatch loop
- Review agents get 'sdlc-review' skill auto-loaded

Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state,
test file additions). Adapted claim_review_task to main's
ttl_seconds: Optional[int] = None convention (matches claim_task).

* feat(kanban): stale detection for running tasks in dispatcher

Salvages #23790 by @thewillhuang. Adds detect_stale_running() to
the dispatcher cycle. Running tasks that have been started for longer
than dispatch_stale_timeout_seconds (default 14400 = 4h) without a
heartbeat in the last hour are auto-reclaimed to ready.

- New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables)
- New 'stale' field on DispatchResult
- detect_stale_running() in kanban_db.py with heartbeat freshness check
- Records outcome='stale' on run close + 'stale' event; ticks failure counter
- Wires config through gateway embedded dispatcher
- Updates _cmd_dispatch verbose/JSON output and daemon logging

Resolved test-file end-of-file conflict by appending both halves.

* feat(kanban): filter tasks by workflow fields and runs by status/outcome

Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing
workflow_template_id and current_step_key columns:

- list_tasks() accepts workflow_template_id and current_step_key kwargs
- 'hermes kanban list' adds matching CLI flags
- dashboard plugin_api also exposes the filters

Resolved a small conflict in list_tasks signature alongside main's
session_id and order_by additions; combined all three into the single
filter list.

* feat(kanban): add respawn guard to block repeat worker storms

Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker
spawn for tasks where:
- a recent run already succeeded (recent_success — within guard window)
- the previous run hit a quota/auth error (blocker_auth, also auto-blocks)
- a recent task comment includes a GitHub PR URL (active_pr)

The guard prevents repeat worker storms on the same bug/task. Includes
the contributor's review-findings fixup (regex hardening, observability,
auth coverage).

Resolved a small DispatchResult conflict alongside main's 'stale' field;
kept both. Authorship preserved via rebase merge.

* feat(kanban): show dashboard cron jobs across profiles

Salvages #27568 by @SerenityTn. Dashboard cron page now lists cron
jobs from all profiles, with profile-aware filter UI and storage
routing. Includes test coverage for cross-profile listing, mutation,
deletion, and validation.

Also fixes orphan conflict markers in config.py left by an earlier
salvage merge (kanban.dispatch_stale_timeout_seconds was double-nested
in HEAD/PR markers from #28452 salvage of #23790).

* fix(kanban): remove orphan conflict markers from config.py (#28458)

PR #28452 (salvage of #23790, stale detection) merged with leftover
git conflict markers in hermes_cli/config.py around the
`dispatch_stale_timeout_seconds` config block, breaking config import
and any code path that loads it. Cleans up the markers and keeps both
config blocks (worker log rotation/orchestrator + stale detection).

Resolves a self-introduced regression.

* fix(kanban): remove orphan conflict markers from kanban.py (#28459)

PR #28454 (salvage of #26745, workflow filter) merged with leftover
git conflict markers in hermes_cli/kanban.py at three sites:
- _task_to_dict() (session_id alongside workflow_template_id/current_step_key)
- p_list parser (--sort alongside --workflow-template-id/--step-key)
- _cmd_list (order_by alongside the new filter kwargs)

Cleans up the markers and keeps both halves at each site.

Resolves a self-introduced regression.

* feat(kanban): configure worktree paths and branches

Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so
tasks with workspace_kind='worktree' can pin a target branch on
create. Schema migration added to _migrate_add_optional_columns.

- Task.branch_name field + DB column + migration
- create_task accepts branch_name kwarg
- hermes kanban create --branch <name> flag
- kanban show output includes 'Branch: <name>' when set

Cherry-picked the substantive commit (a7558cf27); the PR's tip was
an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list
and show-output conflicts alongside main's session_id and
max_runtime_seconds additions; kept all three.

* feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373)

Skill bundles are tiny YAML files in ~/.hermes/skill-bundles/ that
group several skills under one slash command. Invoking /<bundle-name>
from any surface (CLI, TUI, dashboard, any gateway platform) loads
every referenced skill into a single combined user message.

Use cases:
- /backend-dev → loads github-code-review + test-driven-development
  + github-pr-workflow as one bundle.
- /research → loads several research skills together.
- Team task profiles shared via dotfiles.

Behavior:
- Bundles take precedence over individual skills when slugs collide.
- Missing skills are skipped with a note, not fatal.
- No system-prompt mutation — bundles generate a fresh user message
  at invocation time, the same way /<skill> does. Prompt cache stays
  intact.
- Works in CLI dispatch, gateway dispatch, autocomplete (CLI + TUI),
  /help display.

Schema (~/.hermes/skill-bundles/<slug>.yaml):
    name: backend-dev
    description: Backend feature work.
    skills:
      - github-code-review
      - test-driven-developme…
Lillard01 pushed a commit to Lillard01/hermes-agent that referenced this pull request May 21, 2026
…ousResearch#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (NousResearch#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (NousResearch#28032 — x.com/status fallbacks link-like)
- colin-chang (NousResearch#28245, NousResearch#28249, NousResearch#28251 — gateway + mattermost fixes)
- felix-windsor (NousResearch#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (NousResearch#28205 — charizard completion menu contrast)
- iqdoctor (NousResearch#28095 — windows installer docs)
- joe102084 (NousResearch#28151 — whitespace-only cron responses)
- jvinals (NousResearch#27936 — Slack U-IDs → DM channel)
- maxmilian (NousResearch#28267 — ModelPickerDialog portal)
- samggggflynn (NousResearch#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.
Mucky010 pushed a commit to Mucky010/hermes-agent that referenced this pull request May 24, 2026
…ousResearch#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (NousResearch#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (NousResearch#28032 — x.com/status fallbacks link-like)
- colin-chang (NousResearch#28245, NousResearch#28249, NousResearch#28251 — gateway + mattermost fixes)
- felix-windsor (NousResearch#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (NousResearch#28205 — charizard completion menu contrast)
- iqdoctor (NousResearch#28095 — windows installer docs)
- joe102084 (NousResearch#28151 — whitespace-only cron responses)
- jvinals (NousResearch#27936 — Slack U-IDs → DM channel)
- maxmilian (NousResearch#28267 — ModelPickerDialog portal)
- samggggflynn (NousResearch#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…ousResearch#28328)

Pre-stages AUTHOR_MAP entries for 10 new contributors whose PRs are being
salvaged in the May 2026 low-hanging-fruit batch (group 8). Lands ahead
of the per-PR salvage PRs so they don't get blocked by AUTHOR_MAP CI.

Contributors:
- AceWattGit (NousResearch#28159 — _pool_may_recover_from_rate_limit NameError)
- YuanHanzhong (NousResearch#28032 — x.com/status fallbacks link-like)
- colin-chang (NousResearch#28245, NousResearch#28249, NousResearch#28251 — gateway + mattermost fixes)
- felix-windsor (NousResearch#28019 — preserve cron asterisks in strip mode)
- houenyang-momo (NousResearch#28205 — charizard completion menu contrast)
- iqdoctor (NousResearch#28095 — windows installer docs)
- joe102084 (NousResearch#28151 — whitespace-only cron responses)
- jvinals (NousResearch#27936 — Slack U-IDs → DM channel)
- maxmilian (NousResearch#28267 — ModelPickerDialog portal)
- samggggflynn (NousResearch#27952 — dingtalk pre_start)

Per references/batch-pr-salvage-may14-additions.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants