Skip to content

fix(hermes): suspend kanban dispatcher under shields-up#4703

Merged
cv merged 3 commits into
mainfrom
fix/4299-kanban-dispatcher-shields-up
Jun 3, 2026
Merged

fix(hermes): suspend kanban dispatcher under shields-up#4703
cv merged 3 commits into
mainfrom
fix/4299-kanban-dispatcher-shields-up

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Under shields-up the Hermes config root /sandbox/.hermes is root-owned read-only, so the gateway's embedded kanban dispatcher fails every tick when kanban_db.connect() runs PRAGMA journal_mode=WAL against the locked kanban.db. The dispatcher reconnects each cycle, filling /tmp/gateway.log with sqlite3.OperationalError: attempt to write a readonly database tracebacks while the health endpoint stays 200, so the breakage is silent. This change exports the upstream-supported HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 env override when the config root is locked, cleanly suspending the dispatcher and notifier loops without touching upstream Hermes.

Related Issue

Fixes #4299

Changes

  • agents/hermes/start.sh: new apply_shields_up_runtime_env helper that detects the shields-up posture via the existing hermes_config_root_is_locked probe and exports HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 if the caller has not set it. Called in both the non-root and root entrypoint paths, after config-integrity verification and before gateway launch so the gateway child inherits it.
  • test/hermes-start.test.ts: three new cases asserting the override fires under shields-up, stays off when shields are down, and preserves a caller-supplied value.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes
  • `npm test` passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • `npm run docs` builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Hermes now automatically disables the kanban dispatcher in the gateway when the configuration root is locked; manual overrides provided by users are preserved.
  • Tests

    • Added tests covering dispatcher behavior for locked, unlocked, and user-override scenarios, including emitted stderr messages for the forced disablement.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 884465e6-1ffa-4bd5-af21-ebe3251c9527

📥 Commits

Reviewing files that changed from the base of the PR and between 54d4f7a and bb1bf59.

📒 Files selected for processing (1)
  • test/hermes-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/hermes-start.test.ts

📝 Walkthrough

Walkthrough

Hermes startup now defines apply_shields_up_runtime_env() to detect locked config roots and disable the kanban dispatcher by setting HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 unless caller-supplied. This function is integrated into both non-root and root startup paths before provider placeholder refresh, with tests verifying locked/unlocked and override-preservation behavior.

Changes

Shields-up kanban dispatcher override

Layer / File(s) Summary
Shields-up kanban dispatcher override function and startup integration
agents/hermes/start.sh
Defines apply_shields_up_runtime_env() which checks whether the Hermes config root is locked and injects HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 (with Shields-up logging) unless already set. The function is called in both non-root (line 850) and root (line 899) startup branches before provider placeholder refresh and messaging configuration.
Test coverage for shields-up override behavior
test/hermes-start.test.ts
Adds LOCKED_HERMES_CONFIG_STAT_MOCK, updates runtime script generation to reuse the mock, and introduces runShieldsUpRuntimeEnv plus a Vitest suite with three assertions: dispatcher forced to 0 when config root is locked, unchanged when unlocked, and caller-supplied overrides preserved under shields-up.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Integration: Hermes, Sandbox

Suggested reviewers

  • miyoungc

Poem

"I nibble logs beneath the moonlit ground,
Shields rise high where quiet files are found.
Hermes hushes kanban's frantic drum,
No scribbled tracebacks beating like a drum.
Tests hop close — the garden sleeps, all sound."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(hermes): suspend kanban dispatcher under shields-up' clearly and concisely summarizes the main change—implementing a fix to suspend the kanban dispatcher when the Hermes config is locked under shields-up.
Linked Issues check ✅ Passed The pull request fully addresses all coding requirements from issue #4299: it detects shields-up via the config root lock probe, exports the upstream-supported environment override HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 to suspend the dispatcher, and includes comprehensive tests validating the override behavior under locked/unlocked configurations.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the issue: the helper function and its integration in the startup script detect and handle shields-up conditions, and the tests validate the new functionality without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4299-kanban-dispatcher-shields-up

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 3 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 0 still apply, 2 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: agents/hermes/start.sh shields-up kanban dispatcher workaround: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: apply_shields_up_runtime_env is a localized runtime-env workaround; the linked issue offers three possible resolutions, and this PR implements the env-disable route without runtime log validation.
  • Shields-up dispatcher suspension can be bypassed by preset env (agents/hermes/start.sh:327): The new helper only sets HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 when the variable is unset or empty. A caller-supplied value such as 1 is preserved under the locked config-root posture, which can re-enable the embedded dispatcher in exactly the read-only shields-up state this patch is intended to protect. That may be intentional compatibility, but it weakens the runtime policy if shields-up is supposed to be authoritative.
    • Recommendation: If shields-up must force the safe posture, set HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 unconditionally when hermes_config_root_is_locked is true, or at least reject/log-and-override enabling values. If manual override is intentional, document who is allowed to supply it and why it is safe despite the readonly kanban.db failure mode.
    • Evidence: agents/hermes/start.sh:325-331 exports 0 only inside `if [ -z "${HERMES_KANBAN_DISPATCH_IN_GATEWAY:-}" ]`; test/hermes-start.test.ts:543-549 explicitly verifies that preset value `1` is preserved under shields-up.
  • Issue's gateway-log acceptance path is not runtime-validated (test/hermes-start.test.ts:523): The tests prove the helper exports the environment default for a mocked locked config root, but they do not launch a Hermes gateway or assert the linked issue's observable requirement that /tmp/gateway.log under shields-up stops producing Traceback/FATAL/Aborting output. The fix also depends on upstream Hermes honoring HERMES_KANBAN_DISPATCH_IN_GATEWAY=0, which is not verified here.
    • Recommendation: Add or identify a targeted runtime/integration validation that starts the Hermes gateway in a locked config-root posture, confirms the gateway child inherits HERMES_KANBAN_DISPATCH_IN_GATEWAY=0, and checks the gateway log for absence of the repeated kanban sqlite Traceback/FATAL/Aborting pattern. If full gateway validation is too heavy for unit tests, document the manual/runtime validation tied to this workaround.
    • Evidence: Issue [DGX Spark][Sandbox] Hermes gateway kanban dispatcher logs repeated Traceback under shields-up — sqlite3 OperationalError on readonly database #4299 states `gateway.log` under shields-up `MUST NOT contain "Traceback", "FATAL", or "Aborting"`; the new tests in test/hermes-start.test.ts:523-549 only inspect the shell helper's printed `KANBAN=` value and stderr message.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: agents/hermes/start.sh shields-up kanban dispatcher workaround: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: apply_shields_up_runtime_env is a localized runtime-env workaround; the linked issue offers three possible resolutions, and this PR implements the env-disable route without runtime log validation.
  • Shields-up dispatcher suspension can be bypassed by preset env (agents/hermes/start.sh:327): The new helper only sets HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 when the variable is unset or empty. A caller-supplied value such as 1 is preserved under the locked config-root posture, which can re-enable the embedded dispatcher in exactly the read-only shields-up state this patch is intended to protect. That may be intentional compatibility, but it weakens the runtime policy if shields-up is supposed to be authoritative.
    • Recommendation: If shields-up must force the safe posture, set HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 unconditionally when hermes_config_root_is_locked is true, or at least reject/log-and-override enabling values. If manual override is intentional, document who is allowed to supply it and why it is safe despite the readonly kanban.db failure mode.
    • Evidence: agents/hermes/start.sh:325-331 exports 0 only inside `if [ -z "${HERMES_KANBAN_DISPATCH_IN_GATEWAY:-}" ]`; test/hermes-start.test.ts:543-549 explicitly verifies that preset value `1` is preserved under shields-up.
  • Issue's gateway-log acceptance path is not runtime-validated (test/hermes-start.test.ts:523): The tests prove the helper exports the environment default for a mocked locked config root, but they do not launch a Hermes gateway or assert the linked issue's observable requirement that /tmp/gateway.log under shields-up stops producing Traceback/FATAL/Aborting output. The fix also depends on upstream Hermes honoring HERMES_KANBAN_DISPATCH_IN_GATEWAY=0, which is not verified here.
    • Recommendation: Add or identify a targeted runtime/integration validation that starts the Hermes gateway in a locked config-root posture, confirms the gateway child inherits HERMES_KANBAN_DISPATCH_IN_GATEWAY=0, and checks the gateway log for absence of the repeated kanban sqlite Traceback/FATAL/Aborting pattern. If full gateway validation is too heavy for unit tests, document the manual/runtime validation tied to this workaround.
    • Evidence: Issue [DGX Spark][Sandbox] Hermes gateway kanban dispatcher logs repeated Traceback under shields-up — sqlite3 OperationalError on readonly database #4299 states `gateway.log` under shields-up `MUST NOT contain "Traceback", "FATAL", or "Aborting"`; the new tests in test/hermes-start.test.ts:523-549 only inspect the shell helper's printed `KANBAN=` value and stderr message.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26868069659
Target ref: 54d4f7a4c5836e06e385c3f134df02182db291e4
Workflow ref: main
Requested jobs: hermes-root-entrypoint-smoke-e2e,hermes-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/hermes-start.test.ts (2)

484-501: ⚡ Quick win

Extract the duplicated stat() mock into a shared helper.

This block is a near-verbatim copy of the stat() mock in runHermesGatewayRuntimeCleanup (Lines 222-235). The two must stay byte-for-byte in sync to keep shields-up detection consistent; if hermes_config_root_is_locked's stat invocations change, both copies have to be updated, which is easy to miss.

Consider hoisting it to a module-level constant (e.g., LOCKED_CONFIG_STAT_MOCK) reused by both helpers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/hermes-start.test.ts` around lines 484 - 501, The duplicated multi-line
stat() mock (currently in the local variable statMock and duplicated in
runHermesGatewayRuntimeCleanup) should be extracted to a single module-level
constant (e.g., LOCKED_CONFIG_STAT_MOCK) and both places should reference that
constant; update the test file to replace the inline statMock definition with a
reference to LOCKED_CONFIG_STAT_MOCK and update runHermesGatewayRuntimeCleanup
to use the same constant so the byte-for-byte mock remains in one place and is
reused by hermes_config_root_is_locked and the other helper.

510-518: ⚡ Quick win

Runtime replication: id mock isn’t needed for this snippet; consider aligning set flags.

  • apply_shields_up_runtime_env (and its dependency hermes_config_root_is_locked / hermes_config_path_is_locked) never branches on id -u in agents/hermes/start.sh, so the lack of an id() mock in test/hermes-start.test.ts lines 510-518 shouldn’t change behavior.
  • The helper script uses set -uo pipefail while agents/hermes/start.sh uses set -euo pipefail; for strict fidelity (and to catch unexpected command failures), align the helper to set -euo pipefail—though in these extracted functions, stat failures are already masked (... || ... || true) and apply_shields_up_runtime_env is explicitly guarded with || return 0.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/hermes-start.test.ts` around lines 510 - 518, Remove the unnecessary
`id` mock from the test snippet because apply_shields_up_runtime_env and its
dependent functions hermes_config_root_is_locked / hermes_config_path_is_locked
never branch on `id -u`; instead, update the helper invocation that currently
uses "set -uo pipefail" to "set -euo pipefail" so the extracted functions mirror
agents/hermes/start.sh's strict flags and will fail on unexpected command errors
(note these functions already mask stat failures and
apply_shields_up_runtime_env is guarded with `|| return 0`).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/hermes-start.test.ts`:
- Around line 484-501: The duplicated multi-line stat() mock (currently in the
local variable statMock and duplicated in runHermesGatewayRuntimeCleanup) should
be extracted to a single module-level constant (e.g., LOCKED_CONFIG_STAT_MOCK)
and both places should reference that constant; update the test file to replace
the inline statMock definition with a reference to LOCKED_CONFIG_STAT_MOCK and
update runHermesGatewayRuntimeCleanup to use the same constant so the
byte-for-byte mock remains in one place and is reused by
hermes_config_root_is_locked and the other helper.
- Around line 510-518: Remove the unnecessary `id` mock from the test snippet
because apply_shields_up_runtime_env and its dependent functions
hermes_config_root_is_locked / hermes_config_path_is_locked never branch on `id
-u`; instead, update the helper invocation that currently uses "set -uo
pipefail" to "set -euo pipefail" so the extracted functions mirror
agents/hermes/start.sh's strict flags and will fail on unexpected command errors
(note these functions already mask stat failures and
apply_shields_up_runtime_env is guarded with `|| return 0`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 421d89c6-2db5-47dc-b4d8-a777f97d1a80

📥 Commits

Reviewing files that changed from the base of the PR and between 46bb3a0 and 54d4f7a.

📒 Files selected for processing (2)
  • agents/hermes/start.sh
  • test/hermes-start.test.ts

@laitingsheng laitingsheng added fix integration: hermes Hermes integration behavior labels Jun 3, 2026
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…tcher-shields-up

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added bug-fix PR fixes a bug or regression v0.0.58 Release target labels Jun 3, 2026
@wscurran wscurran removed fix v0.0.58 Release target labels Jun 3, 2026
@laitingsheng laitingsheng added the v0.0.58 Release target label Jun 3, 2026
@wscurran wscurran requested a review from cv June 3, 2026 16:49
@cv cv self-assigned this Jun 3, 2026
@cv cv merged commit 5fcfcd9 into main Jun 3, 2026
34 of 35 checks passed
@cv cv deleted the fix/4299-kanban-dispatcher-shields-up branch June 3, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression integration: hermes Hermes integration behavior v0.0.58 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Sandbox] Hermes gateway kanban dispatcher logs repeated Traceback under shields-up — sqlite3 OperationalError on readonly database

3 participants