Skip to content

fix(host-aliases): probe legacy gateway container before kubectl exec#4772

Merged
cv merged 3 commits into
mainfrom
fix/4317-probe-legacy-gateway-container
Jun 4, 2026
Merged

fix(host-aliases): probe legacy gateway container before kubectl exec#4772
cv merged 3 commits into
mainfrom
fix/4317-probe-legacy-gateway-container

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Host alias commands depend on the legacy openshell-cluster-nemoclaw k3s gateway container, but the existing driver-based gate at src/lib/actions/sandbox/host-aliases.ts:83 only fires when the registry records openshellDriver as docker or vm. Sandboxes onboarded by older NemoClaw releases (registry pre-dates the field) and kubernetes-driver sandboxes whose gateway never came up both slip past the gate and fall through to docker exec openshell-cluster-nemoclaw kubectl ..., surfacing an opaque Error response from daemon: No such container: openshell-cluster-nemoclaw to the user. Add a runtime probe that runs docker ps --format '{{.Names}}' and exact-matches the legacy container, with a discriminated present | absent | unknown state so docker-down / timeout / spawn failures get a distinct error.

Related Issue

Fixes #4317.

Changes

  • src/lib/actions/sandbox/host-aliases.ts: add probeLegacyGatewayContainer() (unfiltered docker ps --format '{{.Names}}' via dockerSpawnSync, exact post-match on container name) returning present | absent | unknown. assertLegacyGatewayHostAliasSupport consumes the probe after the registry-driver fast path: absent raises the host-aliases-not-supported message; unknown raises a docker-probe-failed message that points at docker info. The driver branch keeps its short-circuit so no docker call runs when the registry already records a direct-container driver.
  • src/lib/actions/sandbox/host-aliases.ts: extract validateAddOptions / validateRemoveOptions and call them before the gateway probe in addSandboxHostAlias / removeSandboxHostAlias, so invalid hostname / IP / missing-arg errors are never masked by a missing legacy gateway.
  • test/cli.test.ts: writeHostAliasDockerStub now answers docker ps for the legacy gateway name (default running, opt-out via gatewayRunning: false); the inline stubs for hosts-add / hosts-list / retry gain the same ps branch. The three log[0] === "exec" regression asserts switch to a kubectl-index sequence check that survives the new leading probe and additionally pins the unfiltered probe argv shape (ps, --format, {{.Names}}, no --filter). New tests cover: legacy gateway missing on a pre-feature registry entry; argument validation running before the gateway probe; docker spawn ENOENT classified distinctly from missing gateway; docker probe timeout classified distinctly; docker probe non-zero exit classified distinctly.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a bounded Docker probe that classifies legacy gateway container presence (present/absent/unknown), centralizes host-alias add/remove validation to run before probing, and updates tests/stubs to assert probe ordering and probe-failure/absent behaviors.

Changes

Legacy Gateway Container Availability Check & Host-alias Validation

Layer / File(s) Summary
Gateway probe implementation
src/lib/actions/sandbox/host-aliases.ts
Import dockerSpawnSync, add HOST_ALIAS_DOCKER_PROBE_TIMEOUT_MS and LegacyGatewayProbe types; implement probeLegacyGatewayContainer() and extend assertLegacyGatewayHostAliasSupport() to handle present/absent/unknown probe outcomes.
Host-alias validation and callsite refactor
src/lib/actions/sandbox/host-aliases.ts
Add validateAddOptions() and validateRemoveOptions(); refactor addSandboxHostAlias() and removeSandboxHostAlias() to validate/normalize inputs before asserting legacy-gateway support.
Test stubs and failure-path coverage
test/cli.test.ts
Extend writeHostAliasDockerStub with gatewayRunning flag and a docker ps preflight; update fixtures and assertions for new docker argv ordering; add tests for legacy-gateway-absent, docker-probe-failure, and validation-before-probe scenarios.

Sequence Diagram

sequenceDiagram
  participant NemoclawCLI
  participant DockerDaemon as Docker
  participant Kubectl
  NemoclawCLI->>Docker: docker ps --format {{.Names}} (probe openshell-cluster-nemoclaw)
  Docker-->>NemoclawCLI: present / absent / error
  alt present
    NemoclawCLI->>Docker: docker exec openshell-cluster-nemoclaw kubectl ... (host-alias flow)
    Docker->>Kubectl: run kubectl
    Kubectl-->>NemoclawCLI: response
  else absent
    NemoclawCLI-->>NemoclawCLI: throw legacy gateway missing error
  else error
    NemoclawCLI-->>NemoclawCLI: throw docker probe failed error with docker info guidance
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4608: Also updates legacy gateway gating for hosts-* flows; related changes to the same assertion path.

Suggested labels

NemoClaw CLI, area: cli, bug-fix, Docker

Suggested reviewers

  • cv
  • cjagwani
  • prekshivyas

Poem

"A rabbit peeks where Docker hums below,
A little ps probe says whether lights still glow.
If the gateway vanishes, I thump with clear intent,
If Docker's sleepy, I leave a helpful hint.
Hop, patch, and test — I guard the hosts, content." 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding a docker ps probe to detect the legacy gateway container before attempting kubectl exec operations.
Linked Issues check ✅ Passed The PR implements the core objective from #4317: adding a runtime docker ps probe to check if the legacy gateway container is running, providing clear error messages instead of opaque Docker failures.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issue objective: implementing the legacy gateway container probe, refactoring validation logic, and adding comprehensive test coverage for probe failure modes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4317-probe-legacy-gateway-container

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: vm-driver-privileged-exec-routing-e2e, sandbox-survival-e2e

Dispatch hint: sandbox-operations-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

  • sandbox-operations-e2e (high): Closest existing live sandbox lifecycle coverage. This PR changes a user-facing sandbox command that interacts with the OpenShell gateway container and Kubernetes sandbox resources, so run the live sandbox operations suite to validate the CLI still works against a real gateway-backed sandbox.

Optional E2E

  • vm-driver-privileged-exec-routing-e2e (low): Adjacent hermetic coverage for docker/vm direct-container routing and Docker ps --format {{.Names}}-style container discovery. Useful because host aliases are intentionally unsupported for these drivers and the PR changes related gateway-vs-direct-driver diagnostics.
  • sandbox-survival-e2e (medium): Optional confidence for gateway-container lifecycle behavior around real sandbox survival and gateway restarts. The PR adds a gateway-container presence probe, but this job does not directly exercise host-alias commands.

New E2E recommendations

  • host-alias live sandbox coverage (high): No existing E2E found that directly runs nemoclaw <sandbox> hosts-add, hosts-list, and hosts-remove against a real legacy gateway sandbox and verifies the resulting Sandbox CR / in-sandbox host resolution. Unit CLI tests cover stubs, but not real Docker/kubectl/OpenShell integration.
    • Suggested test: Add a host-aliases-e2e job/script that onboards a legacy gateway-backed sandbox, adds a temporary hostname/IP, verifies hosts-list, verifies the sandbox resource or /etc/hosts effect after restart/rebuild if applicable, then removes the alias.
  • host-alias negative gateway diagnostics (medium): The PR introduces important user-facing failure classification for missing gateway containers, Docker daemon failures, Docker spawn ENOENT, and probe timeouts. Existing E2E coverage appears limited to unit-style CLI stubs.
    • Suggested test: Add hermetic or regression E2E coverage that dispatches host-alias commands with fake Docker states for missing gateway, Docker down, ENOENT, and timeout, asserting actionable messages and no docker exec kubectl attempt.

Dispatch hint

  • Workflow: E2E / Nightly
  • jobs input: sandbox-operations-e2e

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario E2E files or scenario workflows changed. The product change is limited to sandbox host-alias CLI behavior, and the current scenario suite catalog does not include host-alias/hosts-add/hosts-list/hosts-remove coverage. The other change is a non-scenario test file outside test/e2e-scenario/.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • Extract the legacy-gateway probe to avoid growing the host-aliases monolith (src/lib/actions/sandbox/host-aliases.ts:11): This PR adds the legacy-gateway probe state type, Docker probing, probe timeout, error-message classification, and validation helpers directly into an already large host-alias action file. The deterministic drift signal shows the file grows from 326 to 402 lines (+76), exceeding the repo's monolith-growth threshold. The new probe behavior is separable from host-alias patch construction and would be easier to test and maintain as a focused gateway-probe helper.
    • Recommendation: Move the LegacyGatewayProbe type, docker ps probing, timeout/classification logic, and probe-result-to-error mapping into a focused helper/module, or otherwise offset the production-file growth with equivalent extraction in this PR.
    • Evidence: src/lib/actions/sandbox/host-aliases.ts now owns LegacyGatewayProbe, HOST_ALIAS_DOCKER_PROBE_TIMEOUT_MS, probeLegacyGatewayContainer(), and additional absent/unknown Docker probe branches inside assertLegacyGatewayHostAliasSupport(); deterministic monolith delta is +76 lines.

🔎 Worth checking

  • Real Docker/gateway runtime validation is still worth checking (src/lib/actions/sandbox/host-aliases.ts:134): The stubbed CLI tests cover argv ordering and error classification well, but the linked issue and follow-up comment are platform/runtime-specific: Ubuntu 24.04 with Docker 29 and real legacy gateway container visibility. The new behavior depends on real docker ps --format {{.Names}}, container-name visibility, spawn failure modes, and gateway-present versus gateway-absent runtime state.
    • Recommendation: Add or identify a targeted runtime/integration validation plan for gateway present, gateway absent, and Docker probe failure cases on the affected Linux/Docker environment. Keep the current stubbed unit coverage, but make the real Docker behavior explicit in validation.
    • Evidence: The deterministic test-depth signal marks src/lib/actions/sandbox/host-aliases.ts as runtime_validation_recommended. Added tests use PATH-local Docker stubs for docker ps, ENOENT, timeout, daemon failure, and kubectl argv checks rather than a real Docker daemon/container.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Extract the legacy-gateway probe to avoid growing the host-aliases monolith (src/lib/actions/sandbox/host-aliases.ts:11): This PR adds the legacy-gateway probe state type, Docker probing, probe timeout, error-message classification, and validation helpers directly into an already large host-alias action file. The deterministic drift signal shows the file grows from 326 to 402 lines (+76), exceeding the repo's monolith-growth threshold. The new probe behavior is separable from host-alias patch construction and would be easier to test and maintain as a focused gateway-probe helper.
    • Recommendation: Move the LegacyGatewayProbe type, docker ps probing, timeout/classification logic, and probe-result-to-error mapping into a focused helper/module, or otherwise offset the production-file growth with equivalent extraction in this PR.
    • Evidence: src/lib/actions/sandbox/host-aliases.ts now owns LegacyGatewayProbe, HOST_ALIAS_DOCKER_PROBE_TIMEOUT_MS, probeLegacyGatewayContainer(), and additional absent/unknown Docker probe branches inside assertLegacyGatewayHostAliasSupport(); deterministic monolith delta is +76 lines.
  • Real Docker/gateway runtime validation is still worth checking (src/lib/actions/sandbox/host-aliases.ts:134): The stubbed CLI tests cover argv ordering and error classification well, but the linked issue and follow-up comment are platform/runtime-specific: Ubuntu 24.04 with Docker 29 and real legacy gateway container visibility. The new behavior depends on real docker ps --format {{.Names}}, container-name visibility, spawn failure modes, and gateway-present versus gateway-absent runtime state.
    • Recommendation: Add or identify a targeted runtime/integration validation plan for gateway present, gateway absent, and Docker probe failure cases on the affected Linux/Docker environment. Keep the current stubbed unit coverage, but make the real Docker behavior explicit in validation.
    • Evidence: The deterministic test-depth signal marks src/lib/actions/sandbox/host-aliases.ts as runtime_validation_recommended. Added tests use PATH-local Docker stubs for docker ps, ENOENT, timeout, daemon failure, and kubectl argv checks rather than a real Docker daemon/container.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26959715788
Target ref: 090df9effebf65d38a80bd01b50b66ed0e233232
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

…s, drop fragile filter

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26960995428
Target ref: 35521549c95e783a7da107fabf7c4d2b93616dbb
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images v0.0.59 Release target labels Jun 4, 2026
@prekshivyas prekshivyas self-assigned this Jun 4, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving. Verified the logic against source:

  • probeLegacyGatewayContainer() correctly uses the unfiltered docker ps --format '{{.Names}}' + exact in-code match, sidestepping the fragile --filter name=^...$ anchor. dockerSpawnSync (src/lib/adapters/docker/exec.ts) is just spawnSync("docker", ...), so .error/.code, .status, .stdout/.stderr are all valid, and encoding: "utf-8" makes the streams strings. The present | absent | unknown discriminator cleanly separates a missing gateway from a docker outage/timeout/permission error — which is exactly why dockerSpawnSync is the right call here over dockerCapture (you need to classify, not just capture).
  • The driver fast-path still short-circuits before any docker call, so direct-container drivers don't pay for a probe. Kubernetes/pre-openshellDriver entries with a running gateway return present → no behavior change; the only regression surface is the opaque "No such container" error, which is what this fixes.
  • Reordering arg validation ahead of the gateway probe is a real UX win (bad hostname/IP no longer masked by a missing gateway), and the test pins it (docker.log never created when validation fails).

Test coverage is excellent: pre-feature registry + gateway-down, validate-before-probe, ENOENT, timeout, and non-zero-exit are each asserted distinctly, and the argv-shape checks pin the no---filter contract. CI is fully green including dco-check and the e2e suite.

Nice, careful fix.

@cv cv merged commit 244a99d into main Jun 4, 2026
34 checks passed
@cv cv deleted the fix/4317-probe-legacy-gateway-container branch June 4, 2026 17:45
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.

## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.

Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Updated default model for local Ollama inference setup to qwen3.5:9b
  * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images v0.0.59 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][CLI] nemoclaw hosts-list fails with Docker unknown shorthand flag

4 participants