Skip to content

feat(status): classify failing layer when gateway probe fails (#3271)#4020

Merged
ericksoa merged 2 commits into
mainfrom
fix/3271-classifier-existence-check
May 22, 2026
Merged

feat(status): classify failing layer when gateway probe fails (#3271)#4020
ericksoa merged 2 commits into
mainfrom
fix/3271-classifier-existence-check

Conversation

@cjagwani

@cjagwani cjagwani commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds classifyGatewayFailure and wires it into showSandboxStatus's final fallback branch so nemoclaw <name> status prints a clearly-named failure layer header before the existing actionable hints. Closes the UX gap split out of #2666 / #3270.

Related Issue

Fixes #3271. Supersedes #3309 (kagura-agent), which implemented the same feature but missed the docker ps -a existence check that AC #2 explicitly requires (CodeRabbit major finding on that PR).

Changes

  • src/lib/actions/sandbox/gateway-failure-classifier.ts: new module exposing classifyGatewayFailure(sandboxName, { runners? }) with injectable runners (dockerInfo, dockerIsRunning, dockerExists, portProbe) plus getLayerHeader(layer).
  • Layers: docker_unreachable, container_missing (new, distinct from container_exited per AC feature: custom settings for using build endpoints #2), container_exited_port_conflict, container_exited, gateway_unreachable.
  • Default runners go through src/lib/adapters/docker (dockerInfo, dockerCapture) to satisfy the docker-abstraction guard.
  • src/lib/actions/sandbox/status.ts: calls the classifier and prints the layer header before printGatewayLifecycleHint in the final fallback branch.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • Unit tests in isolation: npx vitest run test/gateway-failure-classifier.test.ts → 8/8 pass (per-layer, including container_missing and short-circuit behavior).
  • Subprocess test in isolation: npx vitest run test/repro-2666-silent-list-status.test.ts → 7/7 pass, including the new "nemoclaw <name> status prints the container_exited_port_conflict layer header (feat(status): classify failing layer when gateway probe fails (#2666 follow-up) #3271)" test which spawns the real CLI against a fake docker stack + a real TCP listener holding the gateway port.
  • test/docker-abstraction-guard.test.ts passes — no direct execSync("docker …") outside src/lib/adapters/docker.
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes (status output is a UX polish, not a contract change)
  • make docs builds without warnings (no doc changes)

⚠️ Committed with --no-verify (user-authorized): the pre-commit Test (CLI) hook (full vitest with v8 coverage) hits unrelated timeout flakes on this macOS workstation (Defender + Spotlight + iMessage indexer contention). The new tests in this PR pass cleanly in isolation. CI on Linux runners is the authoritative gate.

Definition of Done (from #3271)

  • status prints a clearly-named layer header in each classified state (5 layers, expanded from the original 4 to split container_missing from container_exited).
  • Classifier has unit tests per layer.
  • Repro subprocess test extended to assert the named layer for the container-stopped + foreign-port-holder scenario.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added smarter gateway failure diagnostics that identify unreachable Docker, missing or exited gateway containers, and port conflicts; includes clear failure headers.
  • Bug Fixes

    • Status command now shows the appropriate failure header before guidance and exits with a non-zero status when verification fails.
  • Tests

    • Added unit and end-to-end tests covering diagnostics, header ordering, and port-conflict scenarios.

Review Change Stack

Adds `classifyGatewayFailure` and wires it into `showSandboxStatus`'s
final fallback branch so `nemoclaw <name> status` prints a clearly-named
failure layer header before the existing actionable hints.

Layers:
- `docker_unreachable`              — daemon down / socket inaccessible
- `container_missing`               — gateway container is not in `docker ps -a`
- `container_exited_port_conflict`  — exited container + foreign port holder
- `container_exited`                — exited container, port free
- `gateway_unreachable`             — container up but API unresponsive

Classifier checks `docker ps -a` via an injected `dockerExists` runner
before labelling any container_exited* state, so a removed/never-created
container is reported as `container_missing` instead of being conflated
with an exited one (per #3271 AC and CodeRabbit review on #3309).

Docker calls go through `src/lib/adapters/docker` (`dockerInfo`,
`dockerCapture`) so the abstraction guard stays clean.

Test coverage:
- Unit tests per layer (including container_missing and short-circuit
  paths that prove later runners are skipped on early-exit branches).
- Subprocess test in `test/repro-2666-silent-list-status.test.ts` that
  spawns a real `nemoclaw <name> status` against a fake docker stack
  (info OK; `docker ps` empty; `docker ps -a` returns container) and a
  real TCP listener holding the gateway port — asserts
  `container_exited_port_conflict` appears in the CLI output.

Local hooks skipped (--no-verify, user-authorized): pre-commit vitest
hits coverage+macOS Defender timeout flakes on these test files —
unrelated to this change; the new tests pass cleanly in isolation
(`vitest run test/gateway-failure-classifier.test.ts
test/repro-2666-silent-list-status.test.ts`). CI on Linux runners is
the authoritative gate for the full suite.

Supersedes #3309 (kagura-agent) which implemented the same feature but
missed the `docker ps -a` existence check that AC #2 requires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 17ac53ea-8694-4320-8512-5cd675df5463

📥 Commits

Reviewing files that changed from the base of the PR and between 37e52fc and eaf10d3.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/status.ts
  • test/repro-2666-silent-list-status.test.ts

📝 Walkthrough

Walkthrough

A new gateway-failure classifier identifies why a sandbox container's Docker gateway is unreachable by probing daemon reachability, container state, and port availability. The classifier is integrated into the status command fallback path and verified by unit and integration tests.

Changes

Gateway failure classifier

Layer / File(s) Summary
Gateway failure classifier implementation
src/lib/actions/sandbox/gateway-failure-classifier.ts
Defines GatewayFailureLayer enum with five failure modes (docker_unreachable, gateway_unreachable, container_missing, container_exited_port_conflict, container_exited), injectable GatewayFailureRunners for Docker and port checks, and GatewayFailureResult contract. classifyGatewayFailure sequentially probes Docker daemon status, container state, and port availability, returning the appropriate layer and detail. LAYER_HEADERS maps layers to standardized header strings, and getLayerHeader retrieves headers for display.
Status command integration
src/lib/actions/sandbox/status.ts
Imports the classifier and wires it into the fallback branch of showSandboxStatus when live OpenShell gateway verification fails, printing the classified failure layer header before existing diagnostic output and process exit.
Unit test coverage
test/gateway-failure-classifier.test.ts
Vitest suite verifies classifier output layer and detail substrings across multiple scenarios, validates short-circuit behavior (e.g., skipping port probe when container is missing), and confirms getLayerHeader returns headers containing each layer name.
Integration tests and regressions
test/repro-2666-silent-list-status.test.ts
End-to-end tests add ordering assertions for classifier headers relative to lifecycle hints and include a regression test that binds a real TCP port to simulate a foreign process holding the gateway port, asserting container_exited_port_conflict appears in output and the command exits non-zero.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested labels

Sandbox, Docker

Suggested reviewers

  • cv
  • ericksoa

Poem

🐰 I hop through ports and Docker sighs,
I check the daemon, peer the skies.
If containers sleep or ports are tied,
I name the layer, show what's inside.
Small rabbit, big debug — here’s your guide.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a gateway-failure classifier that is integrated into the sandbox status command to provide better error messages when gateway probes fail.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3271-classifier-existence-check

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e

Dispatch hint: sandbox-operations-e2e

Auto-dispatched E2E: sandbox-operations-e2e via nightly-e2e.yaml at eaf10d34576afaf1cdc1f477de390866e23352efnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high): Best existing E2E coverage for this change. It exercises nemoclaw list, nemoclaw <sandbox> status, status fields, process recovery triggered by status, and gateway container kill recovery against a real Docker/OpenShell sandbox runtime.

Optional E2E

  • sandbox-survival-e2e (high): Useful adjacent confidence for status and gateway restart behavior. It validates that nemoclaw <sandbox> status remains usable across gateway stop/start and that the sandbox survives restart, but it does not directly assert the new failure-layer headers.

New E2E recommendations

  • gateway_failure_diagnostics (high): Existing E2E jobs cover healthy status and some recovery paths, but no existing E2E appears to assert the new user-visible Failure layer: ... headers for Docker unreachable, container missing, exited container, or exited-container-with-foreign-port cases in the real CLI.
    • Suggested test: Add a targeted negative gateway-status E2E that creates a registered sandbox, manipulates/stubs the gateway runtime into docker_unreachable, container_missing, container_exited, and container_exited_port_conflict states, then asserts nemoclaw <sandbox> status exits non-zero and prints the expected failure-layer header before recovery guidance.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: eaf10d34576afaf1cdc1f477de390866e23352ef
Findings: 2 blocker(s), 2 warning(s), 1 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: This advisory review used the provided trusted PR metadata, linked issue body, issue comments, GraphQL/status evidence, and supplied diff; no tests, scripts, package-manager commands, or workflows were executed.; The E2E Advisor auto-dispatched sandbox-operations-e2e for the current head SHA, but no passing result for eaf10d3 was present in the provided evidence.; The linked issue #3271 had no issue comments in the provided metadata, so acceptance extraction is based on the issue body only plus PR discussion comments.; Review thread state was available via GraphQL as an empty node list, but GitHub still reports reviewDecision=REVIEW_REQUIRED.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: eaf10d34576afaf1cdc1f477de390866e23352ef
Recommendation: blocked
Confidence: high

Implementation is narrow and well-tested at unit/subprocess level, but mergeStateStatus is BLOCKED and required sandbox E2E pass evidence is missing for the current head SHA.

Gate status

  • CI: pass — Trusted context reports 5 required status contexts completed with no failures for eaf10d3: checks, commit-lint, dco-check, check-hash, changes. GraphQL also shows E2E recommendation completed successfully; non-required build-sandbox-images and build-sandbox-images-arm64 remain in progress.
  • Mergeability: fail — GitHub metadata reports mergeStateStatus=BLOCKED and GraphQL reviewDecision=REVIEW_REQUIRED for PR feat(status): classify failing layer when gateway probe fails (#3271) #4020 at headRefOid eaf10d3.
  • Review threads: pass — GraphQL reviewThreads.nodes is empty and reviewComments is empty; CodeRabbit reported no actionable comments in the latest review. Human review is still required by branch protection/reviewDecision.
  • Risky code tested: fail — Changed files are sandbox gateway/status runtime paths. Unit and subprocess tests were added, but the E2E Advisor required sandbox-operations-e2e for eaf10d3; the only selective E2E pass comment provided targets old SHA 37e52fc, not the current head.

🔴 Blockers

  • PR is currently blocked by GitHub merge state: The PR does not satisfy GitHub mergeability gates even though required status checks are passing. The trusted metadata reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for the current head SHA.
    • Recommendation: Resolve the GitHub merge block, including required human review or other branch-protection requirements, then re-check gates on eaf10d3 before merge.
    • Evidence: GraphQL: headRefOid=eaf10d34576afaf1cdc1f477de390866e23352ef, mergeStateStatus=BLOCKED, reviewDecision=REVIEW_REQUIRED.
  • Required sandbox E2E pass is missing for the current head SHA (src/lib/actions/sandbox/status.ts:214): This PR changes sandbox status behavior and gateway failure diagnostics. The E2E Advisor requires sandbox-operations-e2e and auto-dispatched it for eaf10d3, but the provided passing selective E2E result is for older SHA 37e52fc. Unit tests and fake-docker subprocess tests are useful but do not fully prove real Docker/OpenShell behavior for this head.
    • Recommendation: Confirm sandbox-operations-e2e passes for eaf10d3 before merging. If the auto-dispatched nightly run has completed, attach or verify its result against this exact SHA.
    • Evidence: E2E Advisor comment: Required E2E: sandbox-operations-e2e; Auto-dispatched at eaf10d3. Selective E2E Results comment reports sandbox-operations-e2e success only for target ref 37e52fc.

🟡 Warnings

  • Linked proposal mentions gateway-state hook, but integration is status-only (src/lib/actions/sandbox/status.ts:214): The linked issue proposal asks for a classifier called from src/lib/actions/sandbox/status.ts and src/lib/actions/sandbox/gateway-state.ts:printGatewayLifecycleHint. This PR calls the classifier from status.ts before invoking printGatewayLifecycleHint in multiple fallback branches, but it does not modify gateway-state.ts itself. That may be acceptable if status is the only intended caller, but it is not a literal full implementation of that proposal clause.
    • Recommendation: Confirm that status-level integration before printGatewayLifecycleHint is sufficient, or move/shared-hook the layer-header emission into gateway-state.ts if all printGatewayLifecycleHint callers should receive the header.
    • Evidence: Changed files do not include src/lib/actions/sandbox/gateway-state.ts. Diff adds printGatewayFailureLayerHeader in src/lib/actions/sandbox/status.ts and calls it before printGatewayLifecycleHint/gateway guidance branches.
  • Subprocess coverage covers the required conflict scenario but not every layer (test/repro-2666-silent-list-status.test.ts:247): The issue implementation notes say subprocess testing per layer. The PR adds subprocess assertions for gateway_unreachable, container_missing, and container_exited_port_conflict-related status paths, while docker_unreachable and container_exited are covered by unit tests only. This is probably reasonable given the Definition of Done specifically names the container-stopped plus foreign-port-holder scenario, but it is only partial against the broader implementation-notes wording.
    • Recommendation: Either consciously accept unit-only coverage for the remaining layers, or add lightweight subprocess cases for docker_unreachable and container_exited if maintainers want literal per-layer subprocess coverage.
    • Evidence: test/gateway-failure-classifier.test.ts covers every layer in isolation. test/repro-2666-silent-list-status.test.ts adds subprocess cases for gateway_unreachable, container_missing, and container_exited_port_conflict; no subprocess assertion for docker_unreachable or plain container_exited is shown.

🔵 Suggestions

  • Small status.ts monolith growth remains acceptable but should be watched (src/lib/actions/sandbox/status.ts:99): status.ts is already a relatively large sandbox action file and grows by a few lines to add classifier integration. The extraction into gateway-failure-classifier.ts keeps most new logic out of the monolith, so this is not a blocker.
    • Recommendation: No immediate change required. Continue extracting future status/gateway diagnostic logic into focused helpers rather than expanding status.ts further.
    • Evidence: Trusted monolith delta reports src/lib/actions/sandbox/status.ts grew from 396 to 401 lines, delta 5, warning rationale: current monolith grew by 1-19 lines.

Acceptance coverage

  • met — Print a clearly delimited block naming the failing layer (Docker daemon up but container exited; or container exit + foreign-port-conflict; or gateway not reachable): src/lib/actions/sandbox/status.ts calls printGatewayFailureLayerHeader before existing fallback guidance; gateway-failure-classifier.ts defines getLayerHeader strings beginning with 'Failure layer: ...' for docker_unreachable, container_missing, container_exited_port_conflict, container_exited, and gateway_unreachable.
  • partial — Add a small failing-layer classifier called from src/lib/actions/sandbox/status.ts and src/lib/actions/sandbox/gateway-state.ts:printGatewayLifecycleHint that prints a layer-named header before the existing actionable hints.: classifyGatewayFailure is added in src/lib/actions/sandbox/gateway-failure-classifier.ts and called from status.ts before printGatewayLifecycleHint/gateway guidance. gateway-state.ts itself is not changed in this PR.
  • met — 1. docker_unreachabledocker info fails or times out. Daemon down or socket inaccessible.: classifyGatewayFailure returns layer docker_unreachable when runners.dockerInfo() is false; unit test 'returns docker_unreachable when docker info fails' verifies this path and short-circuit behavior.
  • met — 2. container_exited_port_conflictdocker ps --filter name=openshell-cluster-nemoclaw shows no running container, docker ps -a shows it exited, AND something is listening on the gateway port (host port held by a foreign process).: classifier checks dockerIsRunning false, dockerExists true, then portProbe true to return container_exited_port_conflict. Unit tests cover this, and repro-2666 subprocess test fakes docker ps/docker ps -a while holding a real localhost listener.
  • met — 3. container_exited — same as 2 but no foreign listener on the gateway port.: classifier returns container_exited when dockerInfo succeeds, dockerIsRunning is false, dockerExists is true, and portProbe returns false; unit test 'returns container_exited when container exited AND port is free' covers this.
  • met — 4. gateway_unreachable — container is running but the gateway API does not respond (current generic case).: classifier returns gateway_unreachable when dockerInfo succeeds and dockerIsRunning(DEFAULT_CONTAINER) is true; unit test covers it, and subprocess test asserts the gateway_unreachable header before after-restart guidance.
  • met — For each layer, print a one-line "what's wrong" header followed by the existing recovery hints from printGatewayLifecycleHint.: LAYER_HEADERS contains one-line messages for all layers; status.ts logs getLayerHeader(failure.layer) before printGatewayLifecycleHint in the missing/fallback paths and before existing guidance in after-restart paths. Unit test asserts headers name each layer.
  • met — New helper src/lib/actions/sandbox/gateway-failure-classifier.ts (or extend gateway-state.ts).: The PR adds src/lib/actions/sandbox/gateway-failure-classifier.ts with GatewayFailureLayer, GatewayFailureResult, GatewayFailureRunners, classifyGatewayFailure, and getLayerHeader.
  • met — Port probe via Node's net.connect() with a short timeout — works cross-platform (Linux/macOS/WSL) without depending on ss / lsof / netstat.: defaultPortProbe uses net.connect({ host: '127.0.0.1', port }) and PORT_PROBE_TIMEOUT_MS=2000. No ss/lsof/netstat command is introduced.
  • partial — Container probe via docker ps + docker ps -a with short timeouts; gracefully degrade to unknown if Docker is itself unreachable (already covered by step 1).: dockerContainerListed invokes dockerCapture(['ps', ...]) with DOCKER_TIMEOUT_MS=3000 and optionally '-a'. Docker unreachable maps to docker_unreachable via the initial dockerInfo check rather than a literal unknown layer.
  • met — Unit-testable in isolation: classifier takes injected runners for docker info, docker ps, port-probe so tests can simulate each layer.: GatewayFailureRunners injects dockerInfo, dockerIsRunning, dockerExists, and portProbe. test/gateway-failure-classifier.test.ts uses makeRunners overrides to simulate each layer and short-circuit behavior.
  • partial — Subprocess test (extending test/repro-2666-silent-list-status.test.ts) per layer.: The subprocess test file is extended and includes status header ordering plus the foreign-port-holder case. Not every layer has a subprocess test; docker_unreachable and plain container_exited are unit-tested only.
  • met — The container name openshell-cluster-nemoclaw is hard-coded in NemoClaw's gateway start path; treat the same string as the fixed probe target. If we ever parameterize the gateway name, classifier can read from the same source.: gateway-failure-classifier.ts defines DEFAULT_CONTAINER = 'openshell-cluster-nemoclaw' and uses it for docker ps and docker ps -a probes and detail text.
  • metnemoclaw list does not need layer classification — its contract is "always show the registry," which fix(cli): keep status and list output visible when gateway probe fails (#2666) #3270 already delivers.: Diff only adds classifier integration to status-related fallback paths. Existing repro test for 'nemoclaw list never produces silent empty output' remains and asserts registry output with exit code 0.
  • metstatus prints a clearly-named layer header in each of the four classified states.: getLayerHeader returns headers for the original four states and the PR also adds container_missing as a fifth state. Unit tests assert every layer header contains its layer name.
  • met — Classifier has unit tests per layer.: test/gateway-failure-classifier.test.ts covers docker_unreachable, gateway_unreachable, container_missing, container_exited_port_conflict, and container_exited.
  • met — Repro subprocess test extended to assert the named layer appears for the (container-stopped + foreign-port-holder) scenario.: test/repro-2666-silent-list-status.test.ts adds an async subprocess test that binds a local port, fakes docker ps and docker ps -a for an exited openshell-cluster-nemoclaw container, runs the CLI, asserts output contains container_exited_port_conflict, and expects a non-zero exit.
  • met — Surfaced from [Ubuntu 22.04][CLI&UX][Recovery] nemoclaw status and list return empty output, exit 0 when container is stopped and gateway port is held #2666 / fix(cli): keep status and list output visible when gateway probe fails (#2666) #3270.: The PR extends test/repro-2666-silent-list-status.test.ts and preserves the existing [Ubuntu 22.04][CLI&UX][Recovery] nemoclaw status and list return empty output, exit 0 when container is stopped and gateway port is held #2666 list/status non-silent-output regression coverage while adding feat(status): classify failing layer when gateway probe fails (#2666 follow-up) #3271 classifier checks.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, tokens, passwords, PEM material, credential JSON, or secret-bearing fixtures are introduced. The change adds diagnostic classification and tests with fake docker/openshell scripts.
  • pass — 2. Input Validation and Data Sanitization: The new docker probes use fixed argument arrays through the existing docker adapter and compare exact container names; they do not concatenate user input into shell commands. The port probe connects only to 127.0.0.1 and uses the existing gateway port configuration, limiting SSRF exposure.
  • pass — 3. Authentication and Authorization: No endpoints, authentication checks, authorization checks, token validation, scopes, or ownership decisions are changed. The classifier only emits diagnostic output after gateway verification failure paths.
  • pass — 4. Dependencies and Third-Party Libraries: No new third-party dependencies or registry sources are added. The implementation uses Node's built-in net module and existing NemoClaw docker adapters.
  • pass — 5. Error Handling and Logging: The added output is a static failure-layer header and non-sensitive detail text. It does not log credentials or trust material. Existing lookup.output printing behavior is unchanged.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, key handling, encryption, hashing, or data-protection behavior is changed.
  • pass — 7. Configuration and Security Headers: No HTTP server headers, CORS policy, Dockerfiles, container privileges, exposed ports, workflow permissions, or runtime policy defaults are changed. The diagnostic probe uses localhost and existing gateway port configuration.
  • warning — 8. Security Testing: Unit tests cover each classifier layer and subprocess tests cover key fallback/header behavior including the foreign-port-holder case. However, this is sandbox/gateway runtime behavior and the E2E Advisor required sandbox-operations-e2e; passing evidence is missing for the current head SHA.
  • warning — 9. Holistic Security Posture: The change appears fail-closed and diagnostic-only, with no obvious sandbox escape, SSRF bypass, policy bypass, credential leakage, blueprint tampering, installer trust issue, or workflow trusted-code boundary regression. Because it touches runtime gateway failure paths, required E2E confirmation remains necessary before merge.

Test / E2E status

  • Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: src/lib/actions/sandbox/gateway-failure-classifier.ts and src/lib/actions/sandbox/status.ts. The added unit tests and fake-docker subprocess tests are strong, but they cannot fully prove behavior against a real Docker/OpenShell gateway runtime for the current head SHA.
  • E2E Advisor: missing
  • Required E2E jobs: sandbox-operations-e2e
  • Missing for analyzed SHA: sandbox-operations-e2e

✅ What looks good

  • Codebase drift check: all four patched files exist in the current diff; trusted openPrOverlaps is empty, though status.ts has recent active history and should be reviewed carefully.
  • The new classifier is extracted into a focused module rather than embedding most logic in status.ts.
  • Docker probing goes through existing docker adapters instead of direct shell-string docker execution, aligning with the docker abstraction guard.
  • The classifier distinguishes container_missing from container_exited by checking docker ps -a before considering port conflict, addressing the existence-check gap called out in earlier related work.
  • Unit tests cover every classifier layer plus short-circuit behavior, and the subprocess regression uses a real localhost listener to exercise the foreign-port-holder classification without relying on a fixed host port.
  • The implementation is diagnostic-only and does not alter sandbox credentials, policies, network policy, Dockerfiles, installers, or workflow permissions.

Review completeness

  • This advisory review used the provided trusted PR metadata, linked issue body, issue comments, GraphQL/status evidence, and supplied diff; no tests, scripts, package-manager commands, or workflows were executed.
  • The E2E Advisor auto-dispatched sandbox-operations-e2e for the current head SHA, but no passing result for eaf10d3 was present in the provided evidence.
  • The linked issue feat(status): classify failing layer when gateway probe fails (#2666 follow-up) #3271 had no issue comments in the provided metadata, so acceptance extraction is based on the issue body only plus PR discussion comments.
  • Review thread state was available via GraphQL as an empty node list, but GitHub still reports reviewDecision=REVIEW_REQUIRED.
  • Human maintainer review required: yes

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26259610434
Target ref: 37e52fc192aec292af98d15776a1edea60638868
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The topper now prints the failure-layer header before gateway lifecycle hint exits, including restart-unreachable/missing branches. Local validation passed build plus focused classifier/status tests, and live checks are green at eaf10d3.

@ericksoa ericksoa merged commit 1bedf66 into main May 22, 2026
29 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26296130418
Target ref: eaf10d34576afaf1cdc1f477de390866e23352ef
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@cjagwani cjagwani requested a review from jyaunches May 22, 2026 19:00
cv pushed a commit that referenced this pull request May 22, 2026
## Summary
Refreshes the NemoClaw docs for the v0.0.49 hardening release, including
release notes, command reference updates, troubleshooting guidance,
version metadata, and regenerated user skills.

## Changes
- #3796, #3854, #3863, #3866, #3984, #4001, #4011, #4013, #4020, #4022,
#4023, #4060, #4062 -> `docs/about/release-notes.mdx`: Adds the v0.0.49
hardening release summary covering gateway reliability,
status/doctor/shields and debug UX, OpenClaw compatibility, messaging
channel teardown, Hermes policy scoping, snapshots, source installs and
Docker group security note, GPU preflight, CLI usage, E2E, and CI
improvements.
- #3796 -> `docs/manage-sandboxes/backup-restore.mdx` and
`docs/reference/commands.mdx`: Documents `snapshot restore --to`
overwrite protection and the `--force` opt-in.
- #3863, #4013, #4020, #4023 -> `docs/reference/commands.mdx`: Documents
missing channel argument usage, sandbox-scoped custom preset matching,
session policy preset sync, and gateway failure classification (uses the
real probe states from `src/lib/status-command-deps.ts`).
- #4022, #4060, #4062 -> `docs/reference/troubleshooting.mdx`: Adds
guidance for gateway-down `connect`, source checkout OpenShell
bootstrapping, WDDM placeholder GPU names, and Jetson sandbox GPU
passthrough.
- Release prep -> `docs/project.json`, `docs/versions1.json`,
`.agents/skills/nemoclaw-user-*`: Bumps docs metadata to 0.0.49 and
refreshes generated user skills from the Fern docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

\`make docs\` was attempted locally but did not complete because \`npm\`
returned \`403 Forbidden\` while fetching \`fern-api\` from
\`registry.npmjs.org\` in the sandboxed environment.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Released v0.0.49 with reliability and compatibility improvements
including faster gateway failure diagnostics and safer snapshot restore
behavior
* Enhanced snapshot restore documentation with `--to` cloning and
`--force` overwrite requirements
* Expanded troubleshooting guides for source installs, GPU setup, and
gateway recovery
* Clarified Docker group access requirements and improved CLI command
reference

* **Chores**
  * Version bumped to 0.0.49

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4078?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression v0.0.50 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(status): classify failing layer when gateway probe fails (#2666 follow-up)

3 participants