feat(status): classify failing layer when gateway probe fails (#3271)#4020
Conversation
Adds `classifyGatewayFailure` and wires it into `showSandboxStatus`'s final fallback branch so `nemoclaw <name> status` prints a clearly-named failure layer header before the existing actionable hints. Layers: - `docker_unreachable` — daemon down / socket inaccessible - `container_missing` — gateway container is not in `docker ps -a` - `container_exited_port_conflict` — exited container + foreign port holder - `container_exited` — exited container, port free - `gateway_unreachable` — container up but API unresponsive Classifier checks `docker ps -a` via an injected `dockerExists` runner before labelling any container_exited* state, so a removed/never-created container is reported as `container_missing` instead of being conflated with an exited one (per #3271 AC and CodeRabbit review on #3309). Docker calls go through `src/lib/adapters/docker` (`dockerInfo`, `dockerCapture`) so the abstraction guard stays clean. Test coverage: - Unit tests per layer (including container_missing and short-circuit paths that prove later runners are skipped on early-exit branches). - Subprocess test in `test/repro-2666-silent-list-status.test.ts` that spawns a real `nemoclaw <name> status` against a fake docker stack (info OK; `docker ps` empty; `docker ps -a` returns container) and a real TCP listener holding the gateway port — asserts `container_exited_port_conflict` appears in the CLI output. Local hooks skipped (--no-verify, user-authorized): pre-commit vitest hits coverage+macOS Defender timeout flakes on these test files — unrelated to this change; the new tests pass cleanly in isolation (`vitest run test/gateway-failure-classifier.test.ts test/repro-2666-silent-list-status.test.ts`). CI on Linux runners is the authoritative gate for the full suite. Supersedes #3309 (kagura-agent) which implemented the same feature but missed the `docker ps -a` existence check that AC #2 requires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new gateway-failure classifier identifies why a sandbox container's Docker gateway is unreachable by probing daemon reachability, container state, and port availability. The classifier is integrated into the status command fallback path and verified by unit and integration tests. ChangesGateway failure classifier
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: This advisory review used the provided trusted PR metadata, linked issue body, issue comments, GraphQL/status evidence, and supplied diff; no tests, scripts, package-manager commands, or workflows were executed.; The E2E Advisor auto-dispatched sandbox-operations-e2e for the current head SHA, but no passing result for eaf10d3 was present in the provided evidence.; The linked issue #3271 had no issue comments in the provided metadata, so acceptance extraction is based on the issue body only plus PR discussion comments.; Review thread state was available via GraphQL as an empty node list, but GitHub still reports reviewDecision=REVIEW_REQUIRED. Full advisor summaryPR Review AdvisorBase: Implementation is narrow and well-tested at unit/subprocess level, but mergeStateStatus is BLOCKED and required sandbox E2E pass evidence is missing for the current head SHA. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
Selective E2E Results — ✅ All requested jobs passedRun: 26259610434
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Approved. The topper now prints the failure-layer header before gateway lifecycle hint exits, including restart-unreachable/missing branches. Local validation passed build plus focused classifier/status tests, and live checks are green at eaf10d3.
Selective E2E Results — ✅ All requested jobs passedRun: 26296130418
|
## Summary Refreshes the NemoClaw docs for the v0.0.49 hardening release, including release notes, command reference updates, troubleshooting guidance, version metadata, and regenerated user skills. ## Changes - #3796, #3854, #3863, #3866, #3984, #4001, #4011, #4013, #4020, #4022, #4023, #4060, #4062 -> `docs/about/release-notes.mdx`: Adds the v0.0.49 hardening release summary covering gateway reliability, status/doctor/shields and debug UX, OpenClaw compatibility, messaging channel teardown, Hermes policy scoping, snapshots, source installs and Docker group security note, GPU preflight, CLI usage, E2E, and CI improvements. - #3796 -> `docs/manage-sandboxes/backup-restore.mdx` and `docs/reference/commands.mdx`: Documents `snapshot restore --to` overwrite protection and the `--force` opt-in. - #3863, #4013, #4020, #4023 -> `docs/reference/commands.mdx`: Documents missing channel argument usage, sandbox-scoped custom preset matching, session policy preset sync, and gateway failure classification (uses the real probe states from `src/lib/status-command-deps.ts`). - #4022, #4060, #4062 -> `docs/reference/troubleshooting.mdx`: Adds guidance for gateway-down `connect`, source checkout OpenShell bootstrapping, WDDM placeholder GPU names, and Jetson sandbox GPU passthrough. - Release prep -> `docs/project.json`, `docs/versions1.json`, `.agents/skills/nemoclaw-user-*`: Bumps docs metadata to 0.0.49 and refreshes generated user skills from the Fern docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) \`make docs\` was attempted locally but did not complete because \`npm\` returned \`403 Forbidden\` while fetching \`fern-api\` from \`registry.npmjs.org\` in the sandboxed environment. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Released v0.0.49 with reliability and compatibility improvements including faster gateway failure diagnostics and safer snapshot restore behavior * Enhanced snapshot restore documentation with `--to` cloning and `--force` overwrite requirements * Expanded troubleshooting guides for source installs, GPU setup, and gateway recovery * Clarified Docker group access requirements and improved CLI command reference * **Chores** * Version bumped to 0.0.49 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4078?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
classifyGatewayFailureand wires it intoshowSandboxStatus's final fallback branch sonemoclaw <name> statusprints a clearly-named failure layer header before the existing actionable hints. Closes the UX gap split out of #2666 / #3270.Related Issue
Fixes #3271. Supersedes #3309 (kagura-agent), which implemented the same feature but missed the
docker ps -aexistence check that AC #2 explicitly requires (CodeRabbit major finding on that PR).Changes
src/lib/actions/sandbox/gateway-failure-classifier.ts: new module exposingclassifyGatewayFailure(sandboxName, { runners? })with injectable runners (dockerInfo,dockerIsRunning,dockerExists,portProbe) plusgetLayerHeader(layer).docker_unreachable,container_missing(new, distinct fromcontainer_exitedper AC feature: custom settings for using build endpoints #2),container_exited_port_conflict,container_exited,gateway_unreachable.src/lib/adapters/docker(dockerInfo,dockerCapture) to satisfy the docker-abstraction guard.src/lib/actions/sandbox/status.ts: calls the classifier and prints the layer header beforeprintGatewayLifecycleHintin the final fallback branch.Type of Change
Verification
npx vitest run test/gateway-failure-classifier.test.ts→ 8/8 pass (per-layer, includingcontainer_missingand short-circuit behavior).npx vitest run test/repro-2666-silent-list-status.test.ts→ 7/7 pass, including the new "nemoclaw <name> statusprints thecontainer_exited_port_conflictlayer header (feat(status): classify failing layer when gateway probe fails (#2666 follow-up) #3271)" test which spawns the real CLI against a fake docker stack + a real TCP listener holding the gateway port.test/docker-abstraction-guard.test.tspasses — no directexecSync("docker …")outsidesrc/lib/adapters/docker.make docsbuilds without warnings (no doc changes)--no-verify(user-authorized): the pre-commitTest (CLI)hook (full vitest with v8 coverage) hits unrelated timeout flakes on this macOS workstation (Defender + Spotlight + iMessage indexer contention). The new tests in this PR pass cleanly in isolation. CI on Linux runners is the authoritative gate.Definition of Done (from #3271)
statusprints a clearly-named layer header in each classified state (5 layers, expanded from the original 4 to splitcontainer_missingfromcontainer_exited).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests