feat(status): classify failing layer when gateway probe fails (#3271)#3309
feat(status): classify failing layer when gateway probe fails (#3271)#3309kagura-agent wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughAdds a pluggable gateway failure classifier with default Docker/container/port runners, integrates it into sandbox status output to print a layer-specific header on gateway failures, and includes unit and subprocess regression tests covering all four failure layers. ChangesGateway Failure Layer Classification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/actions/sandbox/gateway-failure-classifier.ts (1)
41-45: ⚡ Quick winPrefer
execFileSyncover shell-interpolatedexecSyncfor docker checks.Line 41 builds a shell command with string interpolation of the
containerparameter. Switching to argv-based execution eliminates shell parsing risk and prevents potential command injection.🔧 Proposed hardening change
-import { execSync } from "node:child_process"; +import { execFileSync, execSync } from "node:child_process"; @@ function defaultDockerIsRunning(container: string): boolean { try { - const out = execSync(`docker ps --filter name=${container} --format "{{.Names}}"`, { + const out = execFileSync( + "docker", + ["ps", "--filter", `name=${container}`, "--format", "{{.Names}}"], + { timeout: DOCKER_TIMEOUT_MS, stdio: "pipe", encoding: "utf-8", - }); + }, + ); return out.trim().split("\n").some((line) => line.trim() === container); } catch { return false; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/gateway-failure-classifier.ts` around lines 41 - 45, The current code uses execSync with a shell-interpolated command which risks command injection; replace the execSync call that builds the docker command (the assignment to out using execSync and the container variable) with execFileSync, invoking "docker" with an argv array like ["ps", "--filter", `name=${container}`, "--format", "{{.Names}}"] and preserving the same options (timeout: DOCKER_TIMEOUT_MS, stdio: "pipe", encoding: "utf-8"); ensure you remove shell interpolation and pass container only as an argument to execFileSync to harden against shell parsing risks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/repro-2666-silent-list-status.test.ts`:
- Around line 321-322: The assertion on the variable combined currently matches
only container_exited|docker_unreachable|gateway_unreachable and misses the
valid classifier container_exited_port_conflict; update the regex used in
expect(combined).toMatch(...) (in test/repro-2666-silent-list-status.test.ts
where combined is asserted) to also allow container_exited_port_conflict (e.g.
add |container_exited_port_conflict or use a pattern like
container_exited(?:_port_conflict)? so both forms match).
---
Nitpick comments:
In `@src/lib/actions/sandbox/gateway-failure-classifier.ts`:
- Around line 41-45: The current code uses execSync with a shell-interpolated
command which risks command injection; replace the execSync call that builds the
docker command (the assignment to out using execSync and the container variable)
with execFileSync, invoking "docker" with an argv array like ["ps", "--filter",
`name=${container}`, "--format", "{{.Names}}"] and preserving the same options
(timeout: DOCKER_TIMEOUT_MS, stdio: "pipe", encoding: "utf-8"); ensure you
remove shell interpolation and pass container only as an argument to
execFileSync to harden against shell parsing risks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 655135ee-1787-48d0-86d7-cdf991421834
📒 Files selected for processing (4)
src/lib/actions/sandbox/gateway-failure-classifier.tssrc/lib/actions/sandbox/status.tstest/gateway-failure-classifier.test.tstest/repro-2666-silent-list-status.test.ts
|
✨ Thanks for submitting this detailed PR about classifying failing layers when the gateway probe fails. This change aims to improve the debugging experience by providing clear headers and recovery hints for different failure scenarios. Related open issues: |
…#3271) Add a gateway-failure-classifier that detects four distinct failure layers in the status command: 1. docker_unreachable — Docker daemon down or socket inaccessible 2. container_exited_port_conflict — container stopped, port held by foreign process 3. container_exited — container not running 4. gateway_unreachable — container running but gateway API unresponsive Each layer prints a clearly-named header before the existing recovery hints from printGatewayLifecycleHint, giving users immediate context about what went wrong. The classifier uses injectable runners (docker info, docker ps, net.connect port probe) for full unit testability. Added: - Unit tests for all 4 layers + short-circuit behavior - Subprocess regression test extending NVIDIA#2666 coverage Closes NVIDIA#3271
Address CodeRabbit review: the regex matching classifier output in the subprocess test omitted container_exited_port_conflict. Added word boundaries for precision.
2a91643 to
fa8f932
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/actions/sandbox/gateway-failure-classifier.ts (1)
39-45: ⚡ Quick winPrefer
execFileSyncover shell command strings for Docker invocations.
execSync(\docker ...`)invokes a shell unnecessarily. UsingexecFileSync("docker", [...])` removes shell parsing and avoids quoting/injection footguns if this becomes configurable later.Proposed fix
-import { execSync } from "node:child_process"; +import { execFileSync, execSync } from "node:child_process"; @@ - const out = execSync(`docker ps --filter name=${container} --format "{{.Names}}"`, { + const out = execFileSync("docker", ["ps", "--filter", `name=${container}`, "--format", "{{.Names}}"], { timeout: DOCKER_TIMEOUT_MS, stdio: "pipe", encoding: "utf-8", });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/gateway-failure-classifier.ts` around lines 39 - 45, The function defaultDockerIsRunning currently calls execSync with a shell string; change it to use execFileSync("docker", argsArray, options) in the defaultDockerIsRunning function to avoid shell parsing/injection: build args like ["ps", "--filter", `name=${container}`, "--format", "{{.Names}}"], call execFileSync with the same options (timeout, stdio: "pipe", encoding: "utf-8"), and keep the existing try/catch behavior and return logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/gateway-failure-classifier.ts`:
- Around line 24-28: The classifier incorrectly treats "not running" as
equivalent to "exited/present", so add an explicit existence check and use it
before labeling container_exited*. Extend the GatewayFailureRunners interface by
adding a dockerContainerExists(container: string): boolean (or similar), then
update the classifier logic that uses dockerIsRunning(container) (and the
container_exited* decision paths around the container_exited labels and the code
handling lines ~88-101) to first call dockerContainerExists(container); only if
exists && !dockerIsRunning(container) mark exited; if not exists, classify as
missing container (or skip exited/port-conflict branches). Ensure all branches
that previously assumed "not running" are updated to use the new existence
check.
---
Nitpick comments:
In `@src/lib/actions/sandbox/gateway-failure-classifier.ts`:
- Around line 39-45: The function defaultDockerIsRunning currently calls
execSync with a shell string; change it to use execFileSync("docker", argsArray,
options) in the defaultDockerIsRunning function to avoid shell
parsing/injection: build args like ["ps", "--filter", `name=${container}`,
"--format", "{{.Names}}"], call execFileSync with the same options (timeout,
stdio: "pipe", encoding: "utf-8"), and keep the existing try/catch behavior and
return logic unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fe4a966d-ae11-4e39-86a1-31ae6d495d33
📒 Files selected for processing (2)
src/lib/actions/sandbox/gateway-failure-classifier.tssrc/lib/actions/sandbox/status.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/actions/sandbox/status.ts
| export type GatewayFailureRunners = { | ||
| dockerInfo: () => boolean; | ||
| dockerIsRunning: (container: string) => boolean; | ||
| portProbe: (port: number) => Promise<boolean>; | ||
| }; |
There was a problem hiding this comment.
Classifier skips the required exited-container presence check.
The container_exited* paths currently trigger whenever the container is “not running”. This misses the explicit “container exists/exited (docker ps -a)" condition, so a missing container can be mislabeled as exited (or exited+port-conflict).
Proposed fix
export type GatewayFailureRunners = {
dockerInfo: () => boolean;
dockerIsRunning: (container: string) => boolean;
+ dockerExists: (container: string) => boolean;
portProbe: (port: number) => Promise<boolean>;
};
+function defaultDockerExists(container: string): boolean {
+ try {
+ const out = execSync(`docker ps -a --filter name=${container} --format "{{.Names}}"`, {
+ timeout: DOCKER_TIMEOUT_MS,
+ stdio: "pipe",
+ encoding: "utf-8",
+ });
+ return out.trim().split("\n").some((line) => line.trim() === container);
+ } catch {
+ return false;
+ }
+}
+
const defaultRunners: GatewayFailureRunners = {
dockerInfo: defaultDockerInfo,
dockerIsRunning: defaultDockerIsRunning,
+ dockerExists: defaultDockerExists,
portProbe: defaultPortProbe,
};
@@
const containerRunning = runners.dockerIsRunning(DEFAULT_CONTAINER);
+ const containerExists = runners.dockerExists(DEFAULT_CONTAINER);
- if (!containerRunning) {
+ if (!containerRunning && containerExists) {
const portInUse = await runners.portProbe(GATEWAY_PORT);
if (portInUse) {
return {
layer: "container_exited_port_conflict",
detail: `Container '${DEFAULT_CONTAINER}' is not running, but port ${GATEWAY_PORT} is held by another process.`,
};
}
return {
layer: "container_exited",
detail: `Container '${DEFAULT_CONTAINER}' is not running.`,
};
}Also applies to: 88-101
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/actions/sandbox/gateway-failure-classifier.ts` around lines 24 - 28,
The classifier incorrectly treats "not running" as equivalent to
"exited/present", so add an explicit existence check and use it before labeling
container_exited*. Extend the GatewayFailureRunners interface by adding a
dockerContainerExists(container: string): boolean (or similar), then update the
classifier logic that uses dockerIsRunning(container) (and the container_exited*
decision paths around the container_exited labels and the code handling lines
~88-101) to first call dockerContainerExists(container); only if exists &&
!dockerIsRunning(container) mark exited; if not exists, classify as missing
container (or skip exited/port-conflict branches). Ensure all branches that
previously assumed "not running" are updated to use the new existence check.
|
Hi 👋 friendly ping — this has been open for 8 days with only automated reviews so far. Would appreciate a human review when convenient. Happy to iterate on any feedback. |
|
Hi — friendly ping. Happy to make adjustments if needed. 🙂 |
|
Heads-up @kagura-agent — opened #4020 carrying this forward from a maintainer branch with the AC #2 gap closed (added an explicit #4020 is currently 31/31 green on CI and includes a subprocess test for the container_exited_port_conflict path against a real TCP listener on the gateway port. Plan is to close this as superseded once #4020 merges — flag if you'd prefer otherwise. Thanks for the original implementation. |
…#4020) ## Summary Adds `classifyGatewayFailure` and wires it into `showSandboxStatus`'s final fallback branch so `nemoclaw <name> status` prints a clearly-named failure layer header before the existing actionable hints. Closes the UX gap split out of #2666 / #3270. ## Related Issue Fixes #3271. Supersedes #3309 (kagura-agent), which implemented the same feature but missed the `docker ps -a` existence check that AC #2 explicitly requires (CodeRabbit major finding on that PR). ## Changes - `src/lib/actions/sandbox/gateway-failure-classifier.ts`: new module exposing `classifyGatewayFailure(sandboxName, { runners? })` with injectable runners (`dockerInfo`, `dockerIsRunning`, `dockerExists`, `portProbe`) plus `getLayerHeader(layer)`. - Layers: `docker_unreachable`, `container_missing` (new, distinct from `container_exited` per AC #2), `container_exited_port_conflict`, `container_exited`, `gateway_unreachable`. - Default runners go through `src/lib/adapters/docker` (`dockerInfo`, `dockerCapture`) to satisfy the docker-abstraction guard. - `src/lib/actions/sandbox/status.ts`: calls the classifier and prints the layer header before `printGatewayLifecycleHint` in the final fallback branch. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] Unit tests in isolation: `npx vitest run test/gateway-failure-classifier.test.ts` → 8/8 pass (per-layer, including `container_missing` and short-circuit behavior). - [x] Subprocess test in isolation: `npx vitest run test/repro-2666-silent-list-status.test.ts` → 7/7 pass, including the new "`nemoclaw <name> status` prints the `container_exited_port_conflict` layer header (#3271)" test which spawns the real CLI against a fake docker stack + a real TCP listener holding the gateway port. - [x] `test/docker-abstraction-guard.test.ts` passes — no direct `execSync("docker …")` outside `src/lib/adapters/docker`. - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes (status output is a UX polish, not a contract change) - [ ] `make docs` builds without warnings (no doc changes)⚠️ Committed with `--no-verify` (user-authorized): the pre-commit `Test (CLI)` hook (full vitest with v8 coverage) hits unrelated timeout flakes on this macOS workstation (Defender + Spotlight + iMessage indexer contention). The new tests in this PR pass cleanly in isolation. CI on Linux runners is the authoritative gate. ## Definition of Done (from #3271) - [x] `status` prints a clearly-named layer header in each classified state (5 layers, expanded from the original 4 to split `container_missing` from `container_exited`). - [x] Classifier has unit tests per layer. - [x] Repro subprocess test extended to assert the named layer for the container-stopped + foreign-port-holder scenario. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added smarter gateway failure diagnostics that identify unreachable Docker, missing or exited gateway containers, and port conflicts; includes clear failure headers. * **Bug Fixes** * Status command now shows the appropriate failure header before guidance and exits with a non-zero status when verification fails. * **Tests** * Added unit and end-to-end tests covering diagnostics, header ordering, and port-conflict scenarios. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4020?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Summary
Add a gateway-failure-classifier that detects four distinct failure layers when
nemoclaw <name> statusencounters a gateway probe failure, printing a clearly-named header before the existing recovery hints.Failure layers (detected in order)
docker_unreachabledocker infofails or times outcontainer_exited_port_conflictcontainer_exitedgateway_unreachableChanges
src/lib/actions/sandbox/gateway-failure-classifier.ts— classifier with injectable runners (docker info,docker ps,net.connectport probe) for full unit testabilitysrc/lib/actions/sandbox/status.ts— call classifier in the final else branch (gateway_error/unknown paths) and print the layer header beforeprintGatewayLifecycleHinttest/gateway-failure-classifier.test.ts— 6 unit tests covering all 4 layers + short-circuit behaviortest/repro-2666-silent-list-status.test.ts— subprocess regression test asserting layer name appears in status outputTesting
Design decisions
net.connect(): Works cross-platform (Linux/macOS/WSL) without depending onss/lsof/netstatopenshell-cluster-nemoclaw: Hardcoded per issue spec (matches gateway start path)GATEWAY_PORT: Uses existingsrc/lib/core/ports.tsconstant (respectsNEMOCLAW_GATEWAY_PORTenv override)Closes #3271
Summary by CodeRabbit