Skip to content

[Nemoclaw][All Platforms] Gateway-unreachable drift behavior unclear: nemoclaw list continues to show onboarded drift after killing sandbox container #3822

@zNeill

Description

@zNeill

Description

Description
With the sandbox container for a default sandbox stopped via docker kill, nemoclaw list still shows a live gateway drift annotation (onboarded: model=…, provider=…) instead of falling back to the onboard-time snapshot only when the gateway is unreachable. In the documented test, step 19 specifies stopping the gateway container (e.g. openshell-cluster-nemoclaw) and expects list to hide the drift line while the gateway is down. In practice, killing an individual sandbox (e.g. openshell-prachi-s-…) does not make the gateway unreachable, and nemoclaw list continues to show the drift line for the default sandbox, which is confusing when following the test as written.

Concretely, after configuring multiple sandboxes and changing the gateway’s live inference route with openshell inference set, then killing a sandbox container rather than the gateway container, nemoclaw list still shows:p... (onboarded: model=moonshotai/kimi-k2.6)

even though the test case says “With the gateway unreachable, list falls back cleanly to the stored onboard-time values — the (onboarded: …) drift line is NOT shown.”

Environment

  • Platform: Linux (e.g. Ubuntu 22.04 / 24.04 / 26.04)
  • GPU: Any
  • Docker: Installed and running (supported NemoClaw/OpenShell runtime)
  • NemoClaw CLI: Installed and working (e.g. via curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash)
  • OpenShell gateway: Running on the host with Docker; gateway container name in docs is openshell-cluster-nemoclaw.
  • Sandboxes:
    • sandbox-a → NVIDIA Cloud API / nvidia/nemotron-3-super-120b-a12b (provider nvidia-prod)
    • sandbox-b → OpenAI / gpt-5.4 (provider openai-api)
    • sandbox-c → Anthropic / claude-sonnet-4-6 (provider anthropic-prod)
  • All OpenShell helper binaries preserved on PATH (e.g. with --keep-openshell):
    • which openshell returns a valid path
    • which openshell-gateway and which openshell-sandbox also return valid paths.

Steps to Reproduce

Preconditions

  1. Ensure NemoClaw CLI is installed and Docker is running.
  2. Ensure OpenShell binaries are present on PATH:which openshell which openshell-gateway which openshell-sandbox
  3. Onboard three sandboxes with distinct providers and models:
    • sandbox-a: NVIDIA Cloud API / nvidia/nemotron-3-super-120b-a12b (provider nvidia-prod)
    • sandbox-b: OpenAI / gpt-5.4 (provider openai-api)
    • sandbox-c: Anthropic / claude-sonnet-4-6 (provider anthropic-prod)

    so that nemoclaw list reflects these values.


  4. Ensure the OpenShell gateway is running in Docker (e.g. container similar to openshell-cluster-nemoclaw).

Repro steps — Part A (list + SSH indicator)

  1. Run:nemoclaw list
  2. Verify all three sandboxes are listed.
  3. Verify each sandbox shows the correct name.
  4. Verify each sandbox shows the correct provider.
  5. Verify each sandbox shows the correct model.
  6. Verify applied policy presets are shown per sandbox.
  7. Verify exactly one sandbox row is marked as default with *.
  8. Destroy sandbox-b (e.g. nemoclaw sandbox-b destroy).
  9. Run:nemoclaw list
  10. Verify sandbox-b is gone; sandbox-a and sandbox-c remain.
  11. Verify sandbox-a and sandbox-c still show the correct provider/model.
  12. In a separate terminal, run:
    nemoclaw sandbox-a connect
    and keep that SSH session open.
  13. In the original terminal, run:nemoclaw list
  14. Observe the SSH session indicator for sandbox-a (●) and confirm it disappears after closing the SSH session and re-running nemoclaw list (if implemented as per docs).

Repro steps — Part B (live gateway inference + drift annotation)


  • Identify the default sandbox (row marked with *) — call it sandbox-default. Note its onboard-time model and provider from the earlier nemoclaw list output (i.e. from ~/.nemoclaw/sandboxes.json).

  • On the host (not inside any sandbox), change the OpenShell gateway inference route:openshell inference set \ --provider nvidia-prod \ --model z-ai/glm-5.1

  • Confirm the live gateway route via:openshell inference get

    and note that it now differs from the onboard-time model/provider of sandbox-default.


  • Run:nemoclaw list

  • In a separate terminal, list Docker containers and stop a sandbox, not the gateway:docker ps # Example output: # CONTAINER ID IMAGE ... NAMES # 3b802bb39a07 openshell/sandbox-from:1779212437 openshell-prachi-s-ee55cb2f-0136-4e39-afc3-74fd41230b6b # 7dea14913437 openshell/sandbox-from:1779135649 openshell-ollama-b82c5a0e-54a6-4618-bb7f-e2c8f1fc6e7a docker kill $(docker ps -q --filter name=openshell-prachi-s)

    (This kills the sandbox container for prachis-s but does not stop the OpenShell gateway container.)


  • Back in the original terminal, run:nemoclaw list
  • Expected Result

    15–18. With the gateway up and the host-side openshell inference set modifying the live route:

    • nemoclaw list shows the default sandbox row with the live gateway model/provider (e.g. z-ai/glm-5.1, nvidia-prod) and an indented drift annotation line:
      (onboarded: model=moonshotai/kimi-k2.6, provider=nvidia-prod)
      reflecting the difference between the live gateway route and the onboard-time stored config.
    • Non-default sandbox rows continue to show their onboard-time values until they are connected again.

    19–20. When the gateway is unreachable (per the original test case, by killing the gateway container, e.g. docker kill $(docker ps -q --filter name=openshell-cluster-nemoclaw)):

    • NemoClaw cannot fetch live gateway state, so nemoclaw list should fall back entirely to the onboard-time snapshot from ~/.nemoclaw/sandboxes.json.
    • The default sandbox row should show only the stored model/provider, and no (onboarded: …) drift line should be printed.
    • nemoclaw list should not crash or emit a stack trace; it should report sandboxes based on stored metadata only.

    Actual Result

    With only the sandbox container killed, and the OpenShell gateway still running, the user sees:

    local-lynnh@2u1g-b650-1386:~/NemoClaw$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3b802bb39a07 openshell/sandbox-from:1779212437 "/opt/openshell/bin/…" 29 minutes ago Up 29 minutes openshell-prachi-s-ee55cb2f-0136-4e39-afc3-74fd41230b6b 7dea14913437 openshell/sandbox-from:1779135649 "/opt/openshell/bin/…" 22 hours ago Up 22 hours openshell-ollama-b82c5a0e-54a6-4618-bb7f-e2c8f1fc6e7a local-lynnh@2u1g-b650-1386:~/NemoClaw$ docker kill $(docker ps -q --filter name=openshell-prachi-s) 3b802bb39a07 local-lynnh@2u1g-b650-1386:~/NemoClaw$ nemoclaw list Sandboxes: ollama agent: openclaw model: qwen2.5:7b provider: ollama-local CPU sandbox policies: none dashboard: http://127.0.0.1:18789/ prachi-s * agent: openclaw model: z-ai/glm-5.1 provider: nvidia-prod CPU sandbox policies: npm, pypi, huggingface, brew, brave (onboarded: model=moonshotai/kimi-k2.6) dashboard: http://127.0.0.1:18790/ * = default sandbox

    Key differences / confusion points:

    • The test script’s step 19 suggests “Stop/block the gateway (e.g. docker kill $(docker ps -q --filter name=openshell-cluster-nemoclaw)), then run nemoclaw list,” but the user instead kills a sandbox container and still sees the drift annotation.
    • From the user’s perspective, following the test “by killing a container” seems to satisfy the “gateway unreachable” condition, yet nemoclaw list continues to show the (onboarded: …) drift line, contrary to the Expected section that says it should disappear while the gateway is down.

    In other words:

    • nemoclaw list correctly continues to show drift when the gateway is still alive, but the test script as written can be misinterpreted; killing a sandbox container does not truly test the “gateway unreachable” path.
    • This creates a mismatch between the documented expectations (“drift line is NOT shown when gateway unreachable”) and the behavior seen when the user follows the steps using a sandbox container instead of the gateway container.

    A fix could be either:

    • Clarify the docs/test to explicitly require killing/restarting the gateway container (not any openshell- container) when validating step 19; and/or
    • Make nemoclaw list explicitly indicate when it is using gateway state vs offline snapshot state, to reduce confusion when containers are partially stopped.

    Bug Details

    Field Value
    Priority Unprioritized
    Action Dev - Open - To fix
    Disposition Open issue
    Module Machine Learning - NemoClaw
    Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL

    [NVB#6192607]

    Metadata

    Metadata

    Assignees

    Labels

    NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or output

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions