Skip to content

fix: recover Hermes Tirith startup marker#3797

Merged
cv merged 7 commits into
mainfrom
bugfix/hermes-tirith-startup-diagnostics
May 19, 2026
Merged

fix: recover Hermes Tirith startup marker#3797
cv merged 7 commits into
mainfrom
bugfix/hermes-tirith-startup-diagnostics

Conversation

@ericksoa

@ericksoa ericksoa commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Retry Hermes' Tirith installer during startup when /sandbox/.hermes/.tirith-install-failed contains download_failed.
  • Keep Hermes gateway startup non-blocking for unknown marker reasons or retry failures.
  • Add redacted Step 7 timeout diagnostics for Hermes Tirith marker, binary presence, and startup/gateway log tails.

Fixes #3793
NVBug: 6190755

Validation

  • bash -n agents/hermes/start.sh
  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/hermes-start.test.ts src/lib/agent/onboard.test.ts
  • npx vitest run --project cli test/nemoclaw-start.test.ts test/ssrf-parity.test.ts

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • New Features

    • Automatic retry handling for Hermes Tirith install markers with safe fallback during startup
    • Enhanced Hermes startup diagnostics collection that redacts sensitive environment variables
  • Tests

    • Unit tests covering Tirith marker retry, non-retryable, and symlink scenarios; diagnostics redaction
    • Updated/generated e2e parity inventory and parity-map entries adding WhatsApp messaging assertions
  • Documentation

    • Added SPDX license/header metadata to parity map files

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Hermes startup handling to detect and optionally retry build-time Tirith install failures, integrates a sandbox diagnostics collector for Hermes startup timeouts, and includes unit tests plus generated e2e parity updates.

Changes

Hermes Tirith Bootstrap Retry and Diagnostics

Layer / File(s) Summary
Start.sh: retry marker handling and env export
agents/hermes/start.sh
Adds retry_tirith_marker_if_needed() that checks ${HERMES_DIR}/.tirith-install-failed, treats regular-file download_failed as retryable and removes the marker when possible, warns on non-retryable reasons or symlinks, and exports HERMES_HOME in the root path before config verification.
Start.sh marker unit tests
test/hermes-start.test.ts
Extracts and executes the bash function to test: removal for download_failed, preservation for unknown reasons, and symlink rejection with expected stderr messages and filesystem side effects.
Onboard: Hermes diagnostics collection
src/lib/agent/onboard.ts
Introduces a shell diagnostics script and exported collectHermesStartupDiagnostics() that runs it inside the sandbox, redacts tokens, filters/limits lines, and returns diagnostics only when the tirith marker exists. Extends failAgentSetup() to accept optional detail lines and integrates diagnostics into gateway timeout handling for Hermes.
Diagnostics collection tests
src/lib/agent/onboard.test.ts
Adds tests for collectHermesStartupDiagnostics() verifying diagnostics when marker present, empty list when absent, and redaction of sensitive env values in tailed logs.
Generated parity inventory and map updates
test/e2e/docs/parity-inventory.generated.json, test/e2e/docs/parity-map.yaml
Inserts new WhatsApp-related parity assertions (M-WA0M-WA9), updates assertion totals (2010→2036), and adds SPDX header lines to the parity map file.

Sequence Diagram

sequenceDiagram
  participant OnboardCLI as Onboard CLI
  participant GatewayProbe as Gateway Health Probe
  participant StartSh as agents/hermes/start.sh
  participant SandboxFS as Sandbox Filesystem
  participant PythonTools as tools.tirith_security (Python)
  participant RunCapture as runCaptureOpenshell
  participant Diagnostics as collectHermesStartupDiagnostics()
  participant FailReporter as failAgentSetup()

  OnboardCLI->>GatewayProbe: wait for gateway (timeout)
  StartSh->>SandboxFS: stat .tirith-install-failed
  alt marker present and regular file
    StartSh->>SandboxFS: read reason
    alt reason == download_failed
      StartSh->>PythonTools: invoke runtime tirith install
      PythonTools-->>StartSh: install success/failure
      StartSh->>SandboxFS: rm marker on success
    else
      StartSh->>OnboardCLI: log non-retryable reason
    end
  else marker is symlink
    StartSh->>OnboardCLI: log unsafe marker warning
  end
  alt Gateway still unhealthy after timeout
    GatewayProbe->>RunCapture: request diagnostics
    RunCapture->>Diagnostics: run diagnostics script in sandbox
    Diagnostics-->>RunCapture: redacted marker info + tailed logs
    RunCapture->>FailReporter: pass diagnostics as detail lines
    FailReporter->>OnboardCLI: report timeout with details
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

fix

Suggested reviewers

  • jyaunches
  • cv

Poem

I’m a rabbit in the log-filled glen,
I sniff the marker, check again.
If "download_failed" hops into view,
I nudge it loose so startup can do,
A happy gateway wakes — hop, hop, woohoo! 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes unrelated changes to test/e2e/docs/parity-inventory.generated.json and test/e2e/docs/parity-map.yaml (WhatsApp parity assertions) that are not mentioned in the linked issue #3793 objectives and should be split out. Remove test/e2e/docs/parity-inventory.generated.json and test/e2e/docs/parity-map.yaml changes (WhatsApp parity additions) from this PR and merge separately to keep the focus on Tirith startup recovery.
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR partially addresses the linked issue #3793 by implementing startup-time fallback (objective B) and diagnostic UX (objective C), but the implementation duplicates Tirith installer logic in bash rather than delegating to existing Python tools as recommended by reviewers, which may not fully satisfy the intent of non-blocking recovery and maintainability. Reconsider whether the bash implementation of retry_tirith_marker_if_needed() duplicates installer logic unnecessarily; consider delegating marker handling to Hermes' existing tools.tirith_security module instead of reimplementing in the shell entrypoint to reduce maintenance complexity.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: recover Hermes Tirith startup marker' is closely related to the PR's main objective of retrying Hermes Tirith installer during startup when the install-failed marker is present, directly addressing issue #3793.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/hermes-tirith-startup-diagnostics

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa changed the title Fix Hermes Tirith startup marker recovery fix: recover Hermes Tirith startup marker May 19, 2026
@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: hermes-e2e
Optional E2E: rebuild-hermes-e2e, hermes-slack-e2e

Dispatch hint: hermes-e2e

Auto-dispatched E2E: hermes-e2e via nightly-e2e.yaml at 31977532c5b65663729fd59817ac225329833075nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • hermes-e2e (high (~60 min timeout; live cloud inference and sandbox build)): Fresh Hermes install/onboard is the highest-signal existing E2E for this PR: it builds/runs the Hermes sandbox image, executes agents/hermes/start.sh, waits for the Hermes health probe, verifies Hermes binary/config state, and performs live inference through the sandbox. This directly covers the changed startup and onboarding health path.

Optional E2E

  • rebuild-hermes-e2e (high (~60 min timeout)): Useful adjacent confidence for Hermes entrypoint changes during rebuild/upgrade of an existing Hermes sandbox image, but less directly tied than a fresh Hermes onboard and therefore not merge-blocking.
  • hermes-slack-e2e (high (~60 min timeout)): Optional because the new Hermes startup diagnostics explicitly redact log tails that may contain Slack tokens, and Hermes Slack exercises a messaging-enabled Hermes gateway startup. It is not required because the changed code only collects diagnostics on health failure and unit tests cover redaction behavior.

New E2E recommendations

  • hermes-tirith-retry-marker (high): No existing E2E appears to seed /sandbox/.hermes/.tirith-install-failed with download_failed before Hermes gateway startup and prove the entrypoint removes it so the Hermes runtime fallback retries Tirith installation. Unit coverage validates the function in isolation, but not the full sandbox startup behavior.
    • Suggested test: Add a Hermes startup regression E2E that creates a Hermes sandbox/state with a retryable Tirith install marker, starts or rebuilds the Hermes sandbox, verifies the marker is removed, the marker symlink case is not read, and Hermes reaches /health.
  • hermes-onboard-health-timeout-diagnostics (medium): The changed collectHermesStartupDiagnostics path only runs when Hermes health polling times out. Existing positive-path Hermes E2Es will not assert that Tirith marker details and redacted log tails appear in the onboard failure message.
    • Suggested test: Add a regression E2E that intentionally prevents Hermes gateway health from becoming ready with a Tirith marker present, runs nemoclaw onboard --agent hermes, and asserts the failure includes redacted Hermes startup diagnostics without leaking tokens.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: hermes-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26099397364
Target ref: b33108794ed61de009d3ad96b3580bd9b7c4b200
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
hermes-e2e ❌ failure

Failed jobs: hermes-e2e. Check run artifacts for logs.

@ericksoa ericksoa added bug Something fails against expected or documented behavior UAT Issues flagged for User Acceptance Testing. NV QA Bugs found by the NVIDIA QA Team labels May 19, 2026
ericksoa added 2 commits May 19, 2026 06:16
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26099797965
Target ref: 2ca502a7d785575ccd8b995ff1166ab462b02994
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/docs/parity-map.yaml (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add missing SPDX license header to this YAML file.

This file is missing the required SPDX copyright/license header.

As per coding guidelines, "Every source file must include an SPDX license header for copyright and Apache-2.0 license".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/docs/parity-map.yaml` at line 1, Add the missing SPDX license header
to the top of the YAML file so every source file includes copyright and license
info; insert a single-line SPDX header (e.g., "SPDX-FileCopyrightText:
[copyright holders]" and "SPDX-License-Identifier: Apache-2.0") immediately
above the existing root key (the "scripts:" entry) so the header appears as the
first lines of parity-map.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e/docs/parity-map.yaml`:
- Line 1: Add the missing SPDX license header to the top of the YAML file so
every source file includes copyright and license info; insert a single-line SPDX
header (e.g., "SPDX-FileCopyrightText: [copyright holders]" and
"SPDX-License-Identifier: Apache-2.0") immediately above the existing root key
(the "scripts:" entry) so the header appears as the first lines of
parity-map.yaml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fae99d2d-767c-4138-960a-56e735a785e7

📥 Commits

Reviewing files that changed from the base of the PR and between 2ca502a and 3d604d7.

📒 Files selected for processing (2)
  • test/e2e/docs/parity-inventory.generated.json
  • test/e2e/docs/parity-map.yaml
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/docs/parity-inventory.generated.json

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26100131745
Target ref: 3d604d7c508f24be1d33cc9da04e31ee9b9f976a
Workflow ref: main
Requested jobs: hermes-e2e,rebuild-hermes-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success
rebuild-hermes-e2e ✅ success

@jyaunches

Copy link
Copy Markdown
Contributor

Thanks for jumping on this — the direction is right, especially the Step 7 diagnostics and the fact that retry failure remains non-blocking. I do think this needs one simplification pass before merge, though.

Request changes: avoid duplicating Tirith recovery in start.sh

The current agents/hermes/start.sh change embeds a fairly large Python bootstrap inside Bash and reaches into private tools.tirith_security internals (_install_tirith, _resolved_path, _install_failure_reason, _clear_install_failed, _mark_install_failed). That makes the entrypoint own a second Tirith installer compatibility layer, and it is the main reason this fix is so much code. It also forces the shell test to fabricate a fake Python package and regex-extract the shell function, which is a sign the logic is in the wrong layer.

From #3793, the useful fact is that Hermes already has a runtime fallback path that works: after nemoclaw <name> connect, hermes gateway run logged tools.tirith_security: tirith not found — downloading latest release... and recovered successfully. So I don't think NemoClaw needs to reimplement that installer path in start.sh; it should just get out of the way and trigger the existing Hermes fallback.

Could we reduce the startup change to marker handling plus invoking the existing fallback? For example, if the marker is what prevents Hermes from retrying, the startup script may only need to safely validate/remove the retryable marker before launching the gateway:

retry_tirith_marker_if_needed() {
  local marker="${HERMES_DIR}/.tirith-install-failed"
  [ -f "$marker" ] && [ ! -L "$marker" ] || return 0
  [ "$(head -n 1 "$marker" 2>/dev/null | tr -d '\r\n')" = "download_failed" ] || return 0

  echo "[tirith-bootstrap] download_failed marker present; letting Hermes runtime fallback retry" >&2
  rm -f "$marker" 2>/dev/null || true
}

If deleting the marker alone is not sufficient, then I would prefer adding/calling a small public Hermes helper (for example python -m tools.tirith_security ... or a dedicated tools.tirith_bootstrap entrypoint) rather than embedding the Python API probing in Bash. That keeps start.sh small and lets the Python behavior be tested directly in Python/Hermes instead of via TypeScript shell-function extraction.

Also: split unrelated parity churn if possible

The PR is currently +934/-199, but more than half of that is the generated WhatsApp parity-map/inventory refresh (test/e2e/docs/parity-*: +534/-196). That looks unrelated to the Hermes Tirith recovery and makes this PR much harder to review. If CI doesn't require it here, please split/remove that into a separate PR.

The diagnostics plumbing in src/lib/agent/onboard.ts looks reasonable to me, and the required hermes-e2e + rebuild-hermes-e2e runs passed on the current head. My main ask is to avoid adding a second Tirith installer path in the entrypoint.

@ericksoa ericksoa self-assigned this May 19, 2026
@ericksoa ericksoa added v0.0.46 Release target integration: hermes Hermes integration behavior labels May 19, 2026
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26102493171
Target ref: cdbf673edc17093c21dcf29854ffd0f1f7a284df
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ⚠️ cancelled

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/hermes-start.test.ts`:
- Around line 16-17: The regex in extractShellFunctionFromSource interpolates
the variable name directly into new RegExp and should escape regex
metacharacters to avoid ReDoS and unintended matches; modify
extractShellFunctionFromSource to first escape `name` (e.g. replace
/[.*+?^${}()|[\]\\]/g with '\\$&') and then build the RegExp using the
escapedName so the pattern becomes new RegExp(`${escapedName}\\(\\)
\\{([\\s\\S]*?)^\\}`, "m").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 29578fbb-f10d-4958-b294-2a6a77fc5413

📥 Commits

Reviewing files that changed from the base of the PR and between 3d604d7 and ac9f57b.

📒 Files selected for processing (3)
  • agents/hermes/start.sh
  • test/e2e/docs/parity-map.yaml
  • test/hermes-start.test.ts

Comment thread test/hermes-start.test.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26102731689
Target ref: ac9f57b1655b579f2ad4cd9cedb784bb69073914
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26103316766
Target ref: b6733b768930b9b94bfcd3cdbc8daa1d9ae979ad
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success

@ericksoa

Copy link
Copy Markdown
Contributor Author

@jyaunches addressed in ac9f57b16 and the small follow-up b6733b7.

For the Tirith startup path, I reduced agents/hermes/start.sh to marker handling only. The embedded Python bootstrap and all calls into private tools.tirith_security internals are gone. The new retry_tirith_marker_if_needed path only considers a regular-file marker whose first line is exactly download_failed; it refuses symlink/non-regular markers, leaves unknown reasons in place with a warning, and removes the retryable marker before hermes gateway run starts. The actual Tirith install/retry is now owned by Hermes' existing runtime fallback, which is the path #3793 already showed could recover once startup was allowed to retry. That keeps the entrypoint from becoming a second installer compatibility layer.

I also rewrote the startup tests around that narrower contract: one test proves download_failed clears the marker, one proves unknown reasons are preserved, and one proves a symlink marker is not read or removed. The Step 7 diagnostics plumbing stayed in place because it is independent of the installer retry path and gives actionable failure context if Hermes still does not become healthy.

On the parity files, I tried to split them out by restoring test/e2e/docs/parity-* to origin/main. The focused scenario-framework lint then failed with legacy-assertion-inventory-current, saying the generated parity inventory is stale and to rerun scripts/e2e/extract-legacy-assertions.ts. Because this branch needs the parity/docs gates to pass, I kept the generated refresh here and added the SPDX header CodeRabbit flagged. Those files are still not part of the Hermes runtime behavior; they are included because current main's generated parity inventory is stale against the lint.

Validation after the simplification included bash -n agents/hermes/start.sh, focused Vitest for test/hermes-start.test.ts and src/lib/agent/onboard.test.ts, strict parity-map validation, focused scenario-framework parity tests, npm run build:cli, and npm run typecheck:cli. Latest head b6733b7 is signed/verified, CodeRabbit is green, and the required hermes-e2e passed on that head.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26116747872
Target ref: 31977532c5b65663729fd59817ac225329833075
Workflow ref: main
Requested jobs: hermes-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-e2e ✅ success

@cv cv merged commit 364ac5e into main May 19, 2026
25 checks passed
@cv cv added the integration: whatsapp WhatsApp integration or channel behavior label May 30, 2026
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression integration: hermes Hermes integration behavior integration: whatsapp WhatsApp integration or channel behavior NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. v0.0.46 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Brev][Ubuntu 22.04][Onboard][Agent&Skills] NemoHermes onboard step 7 times out — Tirith build-time download_failed not retried at startup

4 participants