Skip to content

fix(connect): auto-recover from SSH identity drift after host reboot#2064

Merged
ericksoa merged 1 commit into
mainfrom
fix/2056-reboot-identity-drift-v2
Apr 18, 2026
Merged

fix(connect): auto-recover from SSH identity drift after host reboot#2064
ericksoa merged 1 commit into
mainfrom
fix/2056-reboot-identity-drift-v2

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Auto-recover from SSH identity drift after host reboot instead of requiring full re-onboard
  • Fix registry recovery gate to trigger for bare nemoclaw <name> (no explicit action)
  • Probe live gateway when requestedSandboxName is set, even with empty registry

Fixes #2056

Supersedes #2057 (which had commits from incorrect git identity).

Changes

src/nemoclaw.ts:

  1. Import pruneKnownHostsEntries from ./lib/onboard
  2. Registry recovery gate (line ~2408): added "" to the action allowlist so bare nemoclaw <name> triggers recovery when registry is empty
  3. shouldProbeLiveGateway (line ~477): include requestedSandboxName in the probe condition so recovery works when both registry and session are empty
  4. ensureLiveSandboxOrExit identity_drift handler (line ~763): instead of process.exit(1), clear stale openshell-* entries from ~/.ssh/known_hosts using the existing pruneKnownHostsEntries helper and retry the sandbox lookup

test/cli.test.ts:

  • Updated assertion for the identity drift test to match new auto-recovery behavior (error message changed from "gateway trust material rotated" to "Could not reconnect")

test/reboot-identity-drift.test.ts (new):

  • 6 tests covering healthy reconnect, identity drift detection, and registry recovery from empty state — both bare and explicit command forms

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm run build:cli passes
  • npx vitest run test/reboot-identity-drift.test.ts — 6/6 pass
  • npx vitest run test/cli.test.ts — all pass (including updated identity_drift assertion)
  • npm test — 1690 pass, 1 pre-existing failure (version string mismatch, unrelated)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved SSH identity drift recovery after reboot by clearing stale host keys and attempting reconnection
    • Enhanced gateway probing to better detect live gateways when sandbox names are provided
    • Updated error messaging for reconnection failures
  • Tests

    • Added regression test suite for post-reboot SSH identity drift and registry recovery scenarios

(Fixes #2056)

Three fixes for the post-reboot reconnection failure:

1. Registry recovery gate now triggers for bare `nemoclaw <name>` (no
   explicit action). Previously `args[0]` was undefined, which didn't
   match the allowlist, skipping recovery entirely.

2. Live gateway probe now runs when `requestedSandboxName` is set, even
   if the registry is empty and there's no session file.

3. Identity drift in `ensureLiveSandboxOrExit` now auto-clears stale
   SSH known_hosts entries and retries the sandbox lookup instead of
   immediately exiting with an error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The pull request implements auto-recovery from SSH identity drift after host reboot by clearing stale SSH known_hosts entries and retrying gateway connections, while broadening registry recovery gating to handle bare sandbox commands.

Changes

Cohort / File(s) Summary
Identity Drift Recovery
src/nemoclaw.ts
Added pruneKnownHostsEntries import; identity_drift recovery path now clears stale SSH host keys from ~/.ssh/known_hosts and retries gateway connection. Updated console messaging to reflect SSH key clearing and reconnect attempt. Broadened shouldProbeLiveGateway logic and fixed registry recovery gating to include empty-string action for bare sandbox commands.
Test Assertions
test/cli.test.ts
Updated test expectation for gateway trust rotation scenario to expect "Could not reconnect" message instead of "gateway trust material rotated after restart".
Regression Test Suite
test/reboot-identity-drift.test.ts
New Vitest test suite with temporary isolated HOME fixtures, stub openshell executables, and helpers to simulate post-reboot SSH identity drift and registry recovery. Tests both successful sandbox connection and identity drift detection/recovery paths, plus registry recovery when gateway is live but registry is empty.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A reboot came, SSH keys turned strange,
But clever nemoclaw rose to the change—
Pruning old entries with practised care,
Reconnecting through drift-laden air,
The sandbox hops on, no re-onboard despair! 🏠✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main fix: auto-recovery from SSH identity drift after host reboot, which is the primary change across the files.
Linked Issues check ✅ Passed All coding requirements from issue #2056 are implemented: clearing stale known_hosts entries on identity_drift [#2056], retrying sandbox lookup after cleanup [#2056], and fixing the registry recovery gate to include bare commands [#2056].
Out of Scope Changes check ✅ Passed All changes directly address the three coordinated fixes outlined in issue #2056; no extraneous modifications or unrelated functionality were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2056-reboot-identity-drift-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/reboot-identity-drift.test.ts (1)

9-12: Outdated comment: this PR is the fix for #2056.

Line 12 says "Once the fix for #2056 lands, update these tests to assert auto-recovery" — but this PR implements that fix. Consider removing or updating this comment to avoid confusion for future readers.

♻️ Suggested fix
 // Simulates the post-reboot scenario where the gateway restarts with new SSH
 // keys, causing "handshake verification failed" errors. Verifies:
 //   1. The registry recovery gate triggers for bare `nemoclaw <name>` (no action)
-//   2. Identity drift is detected and surfaced (current behavior)
-//
-// Once the fix for `#2056` lands, update these tests to assert auto-recovery.
+//   2. Identity drift is detected, stale SSH keys are cleared, and reconnect is attempted
+//   3. Registry recovery works when the registry is empty but gateway is live
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/reboot-identity-drift.test.ts` around lines 9 - 12, The top block
comment in reboot-identity-drift.test.ts contains an outdated note "Once the fix
for `#2056` lands, update these tests to assert auto-recovery"; update or remove
that sentence so it reflects that this PR implements the fix for `#2056`—either
delete the line or replace it with a short note that the tests now assert
auto-recovery (keep the remaining comment lines about registry recovery and
identity drift if still relevant) to avoid confusion for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/reboot-identity-drift.test.ts`:
- Around line 9-12: The top block comment in reboot-identity-drift.test.ts
contains an outdated note "Once the fix for `#2056` lands, update these tests to
assert auto-recovery"; update or remove that sentence so it reflects that this
PR implements the fix for `#2056`—either delete the line or replace it with a
short note that the tests now assert auto-recovery (keep the remaining comment
lines about registry recovery and identity drift if still relevant) to avoid
confusion for future readers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 86ff307d-eef8-4445-a479-aeec71b0014b

📥 Commits

Reviewing files that changed from the base of the PR and between bbbaa0f and 4620c7c.

📒 Files selected for processing (3)
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/reboot-identity-drift.test.ts

@ericksoa ericksoa merged commit 2d1649f into main Apr 18, 2026
17 checks passed
ericksoa pushed a commit that referenced this pull request Apr 21, 2026
## Summary

Catches up the user-facing reference and troubleshooting docs with the
CLI and policy behavior changes that landed in v0.0.21. Drafted via the
`nemoclaw-contributor-update-docs` skill against commits in
`v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`.

## Changes

- **`docs/reference/commands.md`**
- `nemoclaw list`: session indicator (●) for connected sandboxes
(#2117).
- `nemoclaw <name> connect`: active-session note; auto-recovery from SSH
identity drift after a host reboot (#2117, #2064).
- `nemoclaw <name> status`: three-state Inference line (`healthy` /
`unreachable` / `not probed`) covering both local and remote providers;
new `Connected` line (#2002, #2117).
- `nemoclaw <name> destroy` and `rebuild`: active-session warning with
second confirm; rebuild reapplies policy presets to the recreated
sandbox (#2117, #2026).
- `nemoclaw <name> policy-add` and `policy-remove`: positional preset
argument and non-interactive flow via
`--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070).
- `nemoclaw <name> policy-list`: registry-vs-gateway desync detection
(#2089).
- **`docs/reference/troubleshooting.md`**
- `Reconnect after a host reboot`: now reflects automatic stale
`known_hosts` pruning on `connect` (#2064).
- `Running multiple sandboxes simultaneously`: onboard's forward-port
collision guard (#2086).
- New section: `openclaw config set` or `unset` is blocked inside the
sandbox (#2081).
- **`docs/network-policy/customize-network-policy.md`**: non-interactive
`policy-add`/`policy-remove` form; preset preservation across rebuild
(#2070, #2026).
- **`docs/inference/use-local-inference.md`**: NIM section now covers
the NGC API key prompt with masked input and `docker login nvcr.io
--password-stdin` behavior (#2043).
- **Generated skills regenerated** to pick up the source changes
(`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`,
plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate
agent skills from docs` hook ran and confirmed source ↔ generated
parity.

Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb`
(skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms),
plus the usual CI / test / refactor / install-plumbing churn.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files (the one
failing test, `test/cli.test.ts > unknown command exits 1`, also fails
on `origin/main` and is unrelated to these markdown-only changes)
- [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test
failure unrelated to docs
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Multi-session SSH connections with concurrent session support.
* Three-state inference health reporting (healthy/unreachable/not
probed) across all providers.
  * Automatic SSH host key rotation detection and recovery.
  * Non-interactive policy preset management via positional arguments.
  * Session indicators in sandbox list view.

* **Bug Fixes**
  * Protected destructive operations with active-session warnings.
  * Policy presets now preserved during sandbox rebuilds.

* **Documentation**
  * NGC registry authentication requirements for container images.
  * Multi-sandbox onboarding and reconnection guidance.
  * Troubleshooting updates for port conflicts and SSH issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(connect): auto-recover from SSH identity drift after host reboot

3 participants