fix(maintenance): surface gateway schema preflight on host-process gateway drift#4602
Conversation
…teway drift backup-all and upgrade-sandboxes only ran the structured OpenShell gateway schema preflight for legacy cluster-container image drift. On a host-process / Docker-driver gateway there is no openshell-cluster-* container, so the preflight returned null and a drifted/older gateway binary fell through to the generic "Failed to query running sandboxes" recovery message instead of the structured "schema preflight failed" guidance. - Add host-process gateway drift detection in gateway-drift.ts: compare the installed OpenShell CLI version against the running/launchable gateway binary's --version. Resolve the binary from a live runtime marker, else env override / sibling of the resolved openshell binary / standard install paths. A stale marker (dead or recycled PID, verified via argv0) is ignored so a reinstall elsewhere cannot fabricate drift. - Only suppress host-process detection when an *active* cluster gateway exists, so a stopped leftover cluster container no longer masks host-process drift. - Format a "Running gateway binary: <bin> (<ver>)" line plus host-process recovery guidance; keep the cluster-container output intact. - Attach host-process drift detail to the reactive protobuf-mismatch path. - Tighten Docker-driver gateway startup readiness in onboard.ts: re-confirm process liveness after the health probes so a gateway that crashes during migration can no longer print a misleading "healthy" line. Preserves the existing fail-closed behavior: backup-all/upgrade-sandboxes exit 1 before trusting live sandbox state or mutating backup/rebuild state. Adds unit coverage (gateway-drift.test.ts, gateway-drift-preflight.test.ts) and host-process e2e variants (live marker, marker-less fallback, stale-marker false-positive guard) in test-gateway-drift-preflight.sh. Fixes NVIDIA#4430 Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds host-process (non-container) gateway drift detection: types and runtime discovery via Docker-driver marker, on-disk gateway binary version probing, drift reporting (host_process_drift or image drift), formatted diagnostics/recovery branching, preflight integration that can early-exit actions, tighter health checks, and unit/e2e tests. ChangesHost-process gateway drift detection
Sequence Diagram(s)This PR does not include additional sequence diagrams. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/adapters/openshell/gateway-drift.ts`:
- Around line 521-530: The recovery text falls back to cluster-container wording
when drift === null; change the logic in the isHostProcessGatewayDrift(...)
conditional branch so that when drift is null you emit a gateway-neutral,
state-preserving sentence instead of referencing container-specific Docker
volumes. Concretely, in the block using gatewayContainerName (and
getGatewayClusterContainerName/gatewayName), detect drift === null and push a
neutral message (e.g., "do not remove gateway Docker volumes or backups unless
you have a backup and explicitly accept state loss") so that host-process null
cases remain state-preserving without mentioning openshell-cluster-* container
names; keep the existing host-process and container-specific messages for the
respective non-null cases and still reference CLI_DISPLAY_NAME where
appropriate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: acf817e3-d9b3-48bb-94c0-6a92f5226705
📒 Files selected for processing (7)
src/lib/actions/gateway-drift-preflight.test.tssrc/lib/actions/sandbox/status.tssrc/lib/adapters/openshell/gateway-drift.test.tssrc/lib/adapters/openshell/gateway-drift.tssrc/lib/onboard.tssrc/lib/onboard/host-gateway-process.tstest/e2e/test-gateway-drift-preflight.sh
…n drift is unknown When a protobuf/schema mismatch is detected reactively but neither cluster-image nor host-process drift could be resolved (the documented in-place-upgrade case on a Docker-driver gateway), the recovery footer fell back to cluster-container wording and told the user not to remove `openshell-cluster-*` Docker volumes — even though no cluster gateway is involved. Emit a gateway-neutral, state-preserving message for that null-drift case; cluster and host-process drift keep their specific guidance. Addresses CodeRabbit review feedback on NVIDIA#4602. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai review The single actionable finding (neutral recovery guidance when drift context is missing) was addressed in daebfcf: the |
|
✅ Actions performedReview triggered.
|
Summary
backup-allandupgrade-sandboxesonly ran the structured OpenShell gateway schema preflight for legacy cluster-container image drift. On a host-process / Docker-driver gateway there is noopenshell-cluster-*container to inspect, so the preflight returnednulland a drifted/older gateway binary fell through to the genericFailed to query running sandboxesrecovery message instead of the structured "schema preflight failed" guidance. This extends drift detection to host-process gateways while preserving the existing fail-closed (no-mutation) behavior.Related Issue
Fixes #4430 — #4430
Changes
src/lib/adapters/openshell/gateway-drift.ts): compare the installed OpenShell CLI version against the running/launchable gateway binary's--version. The binary is resolved from a live runtime marker, elseNEMOCLAW_OPENSHELL_GATEWAY_BIN→ sibling of the resolvedopenshellbinary → standard install paths (mirrorsresolveOpenShellGatewayBinary). A stale marker (dead or recycled PID — verified via argv0 using the sharedhostGatewayCmdlineMatches) is ignored so a reinstall elsewhere cannot fabricate drift.openshell-cluster-*container no longer masks host-process drift.host_process_driftissue kind formats aRunning gateway binary: <bin> (<ver>)line plus host-process-appropriate recovery guidance; cluster-container output is unchanged. Host-process drift detail is also attached to the reactive protobuf-mismatch path.src/lib/onboard.ts, net-neutral): re-confirm process liveness after the status/gateway-info/TCP probes so a gateway that crashes during migration can no longer print a misleading✓ Docker-driver gateway is healthyline.gateway-drift.test.tsandgateway-drift-preflight.test.ts; host-process e2e variants (live marker, marker-less fallback, stale-marker false-positive guard) intest/e2e/test-gateway-drift-preflight.sh.Fail-closed behavior is preserved: both commands still exit 1 before trusting live sandbox state or mutating backup/rebuild state. The narrow remaining gap — an in-place OpenShell upgrade that leaves an old gateway process running while the on-disk binary is already new — has no reliable local served-version signal, but stays fail-closed: querying that process yields a protobuf/schema mismatch that
detectOpenShellStateRpcResultIssuesurfaces with the same structured guidance.Type of Change
Verification
gateway-drift,gateway-drift-preflight,sandbox/status,host-gateway-process,docker-driver-gateway-runtime-marker, and adjacent gateway-state/rebuild tests)test/e2e/test-gateway-drift-preflight.shpasses (cluster image drift, protobuf mismatch, host-process live-marker / marker-less / stale-marker cases)npm run typecheck:cli,biome lint, and the layer-import-boundary check passnpm test(full suite) — targeted + e2e suites run green; the full run has pre-existing environmental failures on this host (umask-dependentconfig-sync, oclif-usage, ollama/inference-route, andtsx-subprocess timeout tests) that reproduce onmainwithout this changecodex review --uncommittedreports no actionable findingsSigned-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests