fix(dashboard): resolve chat TUI argv off event loop#26124
Conversation
d39f954 to
dadfd2c
Compare
There was a problem hiding this comment.
Pull request overview
Moves potentially slow chat TUI argv resolution (including any npm install / npm run build triggered by _make_tui_argv()) off the FastAPI /api/pty WebSocket event loop to prevent dashboard WebSocket stalls and reverse-proxy disconnects.
Changes:
- Added an async
_resolve_chat_argv_async()wrapper that dispatches synchronous argv resolution viaasyncio.to_thread(). - Updated
/api/ptyWebSocket handler to await the async resolver instead of calling the synchronous resolver directly. - Added tests asserting that argv resolution is routed through the async wrapper and uses
asyncio.to_thread().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
hermes_cli/web_server.py |
Adds async argv resolver and uses it in /api/pty to keep the event loop responsive during Node/npm work. |
tests/hermes_cli/test_web_server.py |
Adds unit coverage ensuring the websocket path uses the async resolver and that it dispatches via asyncio.to_thread(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
austinpickett
left a comment
There was a problem hiding this comment.
Use PULL_REQUEST_TEMPLATE.md, address copilot comments.
dadfd2c to
7aa57db
Compare
|
Addressed the requested changes: |
Beautiful, let's wait for the build to pass and I'll approve. Also please resolve any addressed conversations |
7aa57db to
71cbab5
Compare
|
@austinpickett quick update: the addressed Copilot conversation is resolved, and I rebased the PR onto current main to pick up the web i18n type fix that was causing the unrelated Docker/Nix failures. Current PR head is now 71cbab50a414f08d15baf1b4bcf34a53716f849c. Validation re-run locally after the rebase:
CI should be rerunning on the updated head now. Thanks! |
|
@austinpickett quick follow-up: the remaining attribution failure was due to my commit email not being present in scripts/release.py AUTHOR_MAP. I added the mapping:
New PR head: 2196c00f2 Validation re-run locally:
The earlier supply-chain failure appears to be a stale-base false positive from the workflow comparing 7552e0f..71cbab50, which included an unrelated main-branch change to skills/productivity/google-workspace/scripts/setup.py. The actual PR diff before this attribution mapping was only hermes_cli/web_server.py and tests/hermes_cli/test_web_server.py; after this push it additionally includes only this AUTHOR_MAP line. CI should rerun on the new head now. |
|
@austinpickett one more CI note: the latest push (2196c00f2) appears to be in GitHub Actions The visible supply-chain failure is from the previous head (71cbab50), not the latest head. I reproduced that old scanner locally: the only high-signal match was Could you approve/run the workflows for the latest head when you get a chance? The current PR diff is limited to:
|
austinpickett
left a comment
There was a problem hiding this comment.
Review: fix(dashboard): resolve chat TUI argv off event loop
3 files, +80/-1. Fixes blocking subprocess calls (npm install/npm run build) inside an async WebSocket handler that froze the event loop.
(item 1) Fix is correct — asyncio.Lock + asyncio.to_thread() moves blocking work off the event loop while serializing concurrent builds. Matches the right asyncio pattern.
(item 2) Tests are good — one verifies the async wrapper delegates via to_thread, one verifies pty_ws calls the async path instead of the old sync one.
(item 3) Minor scope creep: author-map entry in release.py — one line, zero risk.
Verdict: Approve ✅ — Real bug, correct fix, good tests.
Reviewed by Hermes Agent
|
@austinpickett thanks for the approval. Could you please rerun/approve the checks when you get a chance? The remaining blockers appear to be CI state rather than code review:
The current PR diff is still limited to:
Local focused validation is still green: |
2196c00 to
9dfea8d
Compare
9dfea8d to
49def88
Compare
austinpickett
left a comment
There was a problem hiding this comment.
Code Review — Approved ✅
Tight, well-reasoned fix. _resolve_chat_argv() can run npm install / npm run build via _make_tui_argv(), and calling it synchronously inside the /api/pty WebSocket handler blocked the dashboard event loop long enough for reverse proxies to drop the chat connection. Moving that to a worker thread is the right call.
What I verified:
SystemExitpropagation — confirmedsys.exit(1)from_make_tui_argv(raised when node/npm is missing) still propagates cleanly throughasyncio.to_thread()and is caught by the existingexcept SystemExitblock. No regression to the error path.- Lock placement —
_chat_argv_lockis acquired before dispatching to the threadpool, so queued chat tabs preserve one-build-at-a-time behavior without occupying worker threads while waiting. Correct ordering. - Scope — does not reintroduce mtime/staleness caching removed in
c6ca11618; only relocates where the resolver runs. Build policy intact. - Tests — both new tests are meaningful (dispatch-through-
to_thread+/api/ptyroutes through the async wrapper).
Local results:
pytest tests/hermes_cli/test_web_server.py tests/hermes_cli/test_tui_npm_install.py -q
→ 146 passed, 16 skipped (PTY/platform-gated)
git diff --check → clean
Nothing critical or warning-level. Narrowly scoped, accurate description, good coverage. LGTM.
What does this PR do?
Dashboard chat now resolves its TUI launch command off the FastAPI/WebSocket event loop. The resolver can run
npm install/npm run buildthrough_make_tui_argv(), and doing that synchronously in/api/ptycan block proxy keepalives and other dashboard WebSocket work long enough for reverse-proxy deployments to drop the chat connection.This keeps the current TUI build policy intact: normal production launches still run the correctness-first
npm run buildpath, whileHERMES_TUI_DIRremains the prebuilt/no-build path for distros and containers. The change only moves the potentially slow resolver work to a worker thread for the dashboard chat path.I initially looked at skipping no-op TUI rebuilds, but
mainintentionally simplified that logic inc6ca11618to avoid fragile mtime-based staleness checks:HERMES_TUI_DIRset: use prebuiltdist/entry.js, no build.--dev: run TypeScript sources directly, no production build.npm run buildfor correctness.This PR does not reintroduce bundle staleness caching. It addresses the dashboard-specific failure mode instead: blocking work inside the WebSocket handler's event loop.
Related Issue
Fixes #25351
Type of Change
Changes Made
hermes_cli/web_server.py: add_resolve_chat_argv_async()and call it from the/api/ptyWebSocket handler.hermes_cli/web_server.py: run the synchronous resolver viaasyncio.to_thread()so the dashboard event loop stays responsive while Node/npm work completes.hermes_cli/web_server.py: serialize resolver calls with anasyncio.Lockbefore dispatching to the worker thread. This preserves one-build-at-a-time behavior when multiple chat tabs connect, without occupying default threadpool workers while queued connections wait.tests/hermes_cli/test_web_server.py: add coverage that the resolver is dispatched through the async wrapper and that/api/ptyuses that wrapper.How to Test
git diff --checkuv run --with pytest --with pytest-xdist pytest tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resolve_chat_argv_async_uses_worker_thread tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_pty_ws_resolves_argv_through_async_wrapper -quv run --with pytest --with pytest-xdist --with ptyprocess pytest tests/hermes_cli/test_web_server.py tests/hermes_cli/test_tui_npm_install.py -qChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs
Validation results:
git diff --checkuv run --with pytest --with pytest-xdist pytest tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resolve_chat_argv_async_uses_worker_thread tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_pty_ws_resolves_argv_through_async_wrapper -q2 passeduv run --with pytest --with pytest-xdist --with ptyprocess pytest tests/hermes_cli/test_web_server.py tests/hermes_cli/test_tui_npm_install.py -q156 passedSystemExitbehavior is preserved: if_make_tui_argv()exits because Node/npm is unavailable or the TUI build fails,asyncio.to_thread()propagates that back to the WebSocket handler and the existing error response path still runs.