Skip to content

fix(maintenance): surface gateway schema preflight on host-process gateway drift#4602

Merged
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4430-host-gateway-drift-preflight
Jun 1, 2026
Merged

fix(maintenance): surface gateway schema preflight on host-process gateway drift#4602
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4430-host-gateway-drift-preflight

Conversation

@yimoj

@yimoj yimoj commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

backup-all and upgrade-sandboxes only ran the structured OpenShell gateway schema preflight for legacy cluster-container image drift. On a host-process / Docker-driver gateway there is no openshell-cluster-* container to inspect, so the preflight returned null and a drifted/older gateway binary fell through to the generic Failed to query running sandboxes recovery message instead of the structured "schema preflight failed" guidance. This extends drift detection to host-process gateways while preserving the existing fail-closed (no-mutation) behavior.

Related Issue

Fixes #4430#4430

Changes

  • Host-process gateway drift detection (src/lib/adapters/openshell/gateway-drift.ts): compare the installed OpenShell CLI version against the running/launchable gateway binary's --version. The binary is resolved from a live runtime marker, else NEMOCLAW_OPENSHELL_GATEWAY_BIN → sibling of the resolved openshell binary → standard install paths (mirrors resolveOpenShellGatewayBinary). A stale marker (dead or recycled PID — verified via argv0 using the shared hostGatewayCmdlineMatches) is ignored so a reinstall elsewhere cannot fabricate drift.
  • Host-process detection is suppressed only when an active cluster gateway exists, so a stopped leftover openshell-cluster-* container no longer masks host-process drift.
  • New host_process_drift issue kind formats a Running gateway binary: <bin> (<ver>) line plus host-process-appropriate recovery guidance; cluster-container output is unchanged. Host-process drift detail is also attached to the reactive protobuf-mismatch path.
  • Startup readiness tightening (src/lib/onboard.ts, net-neutral): re-confirm process liveness after the status/gateway-info/TCP probes so a gateway that crashes during migration can no longer print a misleading ✓ Docker-driver gateway is healthy line.
  • Tests: unit coverage in gateway-drift.test.ts and gateway-drift-preflight.test.ts; host-process e2e variants (live marker, marker-less fallback, stale-marker false-positive guard) in test/e2e/test-gateway-drift-preflight.sh.

Fail-closed behavior is preserved: both commands still exit 1 before trusting live sandbox state or mutating backup/rebuild state. The narrow remaining gap — an in-place OpenShell upgrade that leaves an old gateway process running while the on-disk binary is already new — has no reliable local served-version signal, but stays fail-closed: querying that process yields a protobuf/schema mismatch that detectOpenShellStateRpcResultIssue surfaces with the same structured guidance.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • Targeted Vitest suites pass (gateway-drift, gateway-drift-preflight, sandbox/status, host-gateway-process, docker-driver-gateway-runtime-marker, and adjacent gateway-state/rebuild tests)
  • test/e2e/test-gateway-drift-preflight.sh passes (cluster image drift, protobuf mismatch, host-process live-marker / marker-less / stale-marker cases)
  • npm run typecheck:cli, biome lint, and the layer-import-boundary check pass
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • npm test (full suite) — targeted + e2e suites run green; the full run has pre-existing environmental failures on this host (umask-dependent config-sync, oclif-usage, ollama/inference-route, and tsx-subprocess timeout tests) that reproduce on main without this change
  • Docs updated — not required; this changes diagnostic CLI output only, no new user-facing flags or pages
  • codex review --uncommitted reports no actionable findings

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Detect host-process gateway binary drift and surface clear, gateway-specific diagnostics and recovery guidance.
  • Bug Fixes

    • Tighter gateway health checks to avoid reporting a gateway as healthy if its process exits during probing.
  • Tests

    • Added unit and end-to-end tests covering host-process drift detection and early termination behavior for backup/upgrade preflight.

…teway drift

backup-all and upgrade-sandboxes only ran the structured OpenShell gateway
schema preflight for legacy cluster-container image drift. On a host-process /
Docker-driver gateway there is no openshell-cluster-* container, so the
preflight returned null and a drifted/older gateway binary fell through to the
generic "Failed to query running sandboxes" recovery message instead of the
structured "schema preflight failed" guidance.

- Add host-process gateway drift detection in gateway-drift.ts: compare the
  installed OpenShell CLI version against the running/launchable gateway
  binary's --version. Resolve the binary from a live runtime marker, else env
  override / sibling of the resolved openshell binary / standard install paths.
  A stale marker (dead or recycled PID, verified via argv0) is ignored so a
  reinstall elsewhere cannot fabricate drift.
- Only suppress host-process detection when an *active* cluster gateway exists,
  so a stopped leftover cluster container no longer masks host-process drift.
- Format a "Running gateway binary: <bin> (<ver>)" line plus host-process
  recovery guidance; keep the cluster-container output intact.
- Attach host-process drift detail to the reactive protobuf-mismatch path.
- Tighten Docker-driver gateway startup readiness in onboard.ts: re-confirm
  process liveness after the health probes so a gateway that crashes during
  migration can no longer print a misleading "healthy" line.

Preserves the existing fail-closed behavior: backup-all/upgrade-sandboxes exit 1
before trusting live sandbox state or mutating backup/rebuild state.

Adds unit coverage (gateway-drift.test.ts, gateway-drift-preflight.test.ts) and
host-process e2e variants (live marker, marker-less fallback, stale-marker
false-positive guard) in test-gateway-drift-preflight.sh.

Fixes NVIDIA#4430

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6666533-cccc-438d-b201-305791b9cb24

📥 Commits

Reviewing files that changed from the base of the PR and between 8cb2ab2 and daebfcf.

📒 Files selected for processing (2)
  • src/lib/adapters/openshell/gateway-drift.test.ts
  • src/lib/adapters/openshell/gateway-drift.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/adapters/openshell/gateway-drift.test.ts

📝 Walkthrough

Walkthrough

Adds host-process (non-container) gateway drift detection: types and runtime discovery via Docker-driver marker, on-disk gateway binary version probing, drift reporting (host_process_drift or image drift), formatted diagnostics/recovery branching, preflight integration that can early-exit actions, tighter health checks, and unit/e2e tests.

Changes

Host-process gateway drift detection

Layer / File(s) Summary
Type system and core drift detection logic
src/lib/adapters/openshell/gateway-drift.ts
Introduces GatewayHostProcessDrift, GatewayDrift, and HostProcessGatewayRuntime; expands OpenShellStateRpcIssue. Implements reading Docker-driver runtime marker, resolving gateway binary path with PID/cmdline verification, probing binary --version, and comparing to installed CLI to produce host-process drift.
Issue reporting and user-facing output
src/lib/adapters/openshell/gateway-drift.ts
Coalesces image vs host-process drift for protobuf mismatches. formatOpenShellStateRpcIssue now renders "Running gateway binary" and branches recovery guidance: host-process wording omits Docker-volume deletion guidance while image drift retains volume-oriented guidance.
Status contract and preflight action integration
src/lib/actions/sandbox/status.ts, src/lib/actions/gateway-drift-preflight.test.ts
Adds host_process_drift to SandboxStatusReport.rpcIssue kinds. Tests add a hostProcessDriftIssue fixture and verify backupAll() and upgradeSandboxes({ check: true }) early-exit (via process.exit(1)) when preflight reports host-process drift.
Unit test coverage
src/lib/adapters/openshell/gateway-drift.test.ts
Adds tests for getGatewayHostProcessDrift covering absent cluster container, matching/mismatching versions, active-cluster suppression, leftover containers, unprobeable versions, preflight surfacing, formatting, protobuf mismatch association, and disabling under test sentinel.
Gateway health check and utilities
src/lib/onboard.ts, src/lib/onboard/host-gateway-process.ts
Health poll now re-checks child process exit state and PID liveness after status/info/TCP probes. Exports hostGatewayCmdlineMatches for reuse in identity verification.
E2E test infrastructure and scenarios
test/e2e/test-gateway-drift-preflight.sh
Adds live-marker PID tracking/cleanup, docker inspect mock, fake long-lived openshell-gateway binary with configurable --version, runtime marker writer, run_host_process_case helper, and e2e cases asserting preflight early-exit and correct handling of stale markers and marker-less resolution.

Sequence Diagram(s)

This PR does not include additional sequence diagrams.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A gateway drifts both old and new,
Markers point me to the clue —
I probe the binary, check the pid,
Say "halt" when versions don't fit.
Hop, test, and patch — a tidy view.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: extending gateway schema preflight diagnostics to detect and surface host-process gateway drift.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/adapters/openshell/gateway-drift.ts`:
- Around line 521-530: The recovery text falls back to cluster-container wording
when drift === null; change the logic in the isHostProcessGatewayDrift(...)
conditional branch so that when drift is null you emit a gateway-neutral,
state-preserving sentence instead of referencing container-specific Docker
volumes. Concretely, in the block using gatewayContainerName (and
getGatewayClusterContainerName/gatewayName), detect drift === null and push a
neutral message (e.g., "do not remove gateway Docker volumes or backups unless
you have a backup and explicitly accept state loss") so that host-process null
cases remain state-preserving without mentioning openshell-cluster-* container
names; keep the existing host-process and container-specific messages for the
respective non-null cases and still reference CLI_DISPLAY_NAME where
appropriate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: acf817e3-d9b3-48bb-94c0-6a92f5226705

📥 Commits

Reviewing files that changed from the base of the PR and between df7d054 and 8cb2ab2.

📒 Files selected for processing (7)
  • src/lib/actions/gateway-drift-preflight.test.ts
  • src/lib/actions/sandbox/status.ts
  • src/lib/adapters/openshell/gateway-drift.test.ts
  • src/lib/adapters/openshell/gateway-drift.ts
  • src/lib/onboard.ts
  • src/lib/onboard/host-gateway-process.ts
  • test/e2e/test-gateway-drift-preflight.sh

Comment thread src/lib/adapters/openshell/gateway-drift.ts
…n drift is unknown

When a protobuf/schema mismatch is detected reactively but neither cluster-image
nor host-process drift could be resolved (the documented in-place-upgrade case
on a Docker-driver gateway), the recovery footer fell back to cluster-container
wording and told the user not to remove `openshell-cluster-*` Docker volumes —
even though no cluster gateway is involved. Emit a gateway-neutral,
state-preserving message for that null-drift case; cluster and host-process
drift keep their specific guidance. Addresses CodeRabbit review feedback on NVIDIA#4602.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yimoj yimoj added the v0.0.56 Release target label Jun 1, 2026
@yimoj

yimoj commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

The single actionable finding (neutral recovery guidance when drift context is missing) was addressed in daebfcf: the protobuf_mismatch + drift === null case now emits gateway-neutral, state-preserving guidance instead of cluster-container wording. Requesting an incremental review to confirm.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@yimoj Sure, I'll run an incremental review focusing on the protobuf_mismatch + drift === null neutral-guidance fix from daebfcf2e and any related changes.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@wscurran

wscurran commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@cv cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026
@cv cv merged commit 83db0d0 into NVIDIA:main Jun 1, 2026
30 checks passed
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.57 Release target

Projects

None yet

3 participants