Skip to content

[All Platforms][Docs] reference/troubleshooting.md recovery sections reference three non-existent openshell gateway subcommands (start, trust, destroy) #3685

@zNeill

Description

@zNeill

Description

Description

docs/reference/troubleshooting.md (rendered at
https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html)
recommends three `openshell gateway` subcommands in its recovery flows that
do not exist on the shipped OpenShell 0.0.39 / NemoClaw v0.0.44 toolchain:

  1. `openshell gateway start --name nemoclaw`
     — in "Reconnect after a host reboot" (commands.md:518, troubleshooting
       page section "Runtime" → "Reconnect after a host reboot")

  2. `openshell gateway trust -g nemoclaw`
     — in "Sandbox creation reports a TLS certificate mismatch"
       (troubleshooting.md:606)

  3. `openshell gateway destroy` + `openshell gateway start`
     — in "k3s cannot find a freshly built image" under the DGX Spark
       section (troubleshooting.md:1042-1043)

The valid `openshell gateway` subcommands on v0.0.39 are:

  add, remove, login, logout, select, info, list

Each of `start`, `trust`, and `destroy` produces:

  error: unrecognized subcommand ''

Impact is high because all three references live inside symptom-fix flows
that a real user would run during a real failure (host reboot, TLS reset,
k3s image cache issue on DGX Spark) — they will hit the "unrecognized
subcommand" error immediately and be stuck.

The rest of the page checks out: tested 75 H2/H3 sections, 81 code blocks,
77 same-page anchors (all resolve), 48 internal links (all 200), 16 external
links (all 200), and verified the prose command references for nemoclaw
onboard / rebuild / list / status / policy-add / channels / inference /
debug / gc / uninstall / tunnel / openshell sandbox list/delete / openshell
forward start/list / openshell term against the live CLIs — those are
correct. Drift is isolated to the three `openshell gateway` recovery
commands.
Environment
Device:        ipp2-1558 (10.176.178.100), x86_64 server, 32 vCPU / 125 GB RAM, NVIDIA A100 80GB PCIe
OS:            Ubuntu 24.04.4 LTS (Linux 6.17.0-23-generic)
Architecture:  x86_64
Node.js:       v22.x (installed via nvm by NemoClaw installer)
npm:           bundled
Docker:        29.5.0
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.44
OpenClaw:      N/A (docs-only bug)
Steps to Reproduce
1. Open https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html
2. In "Runtime" → "Reconnect after a host reboot", read step 3:

     $ openshell gateway start --name nemoclaw

3. In "Runtime" → "Sandbox creation reports a TLS certificate mismatch",
   read the recovery snippet:

     $ openshell gateway trust -g nemoclaw
     $ nemoclaw onboard --resume

4. In "DGX Spark" → "k3s cannot find a freshly built image", read the
   recovery snippet:

     $ openshell gateway destroy
     $ openshell gateway start

5. Run each of the three subcommands against OpenShell 0.0.39:

     $ openshell gateway start --name nemoclaw
     $ openshell gateway trust -g nemoclaw
     $ openshell gateway destroy

6. List the real subcommands:

     $ openshell gateway --help
Expected Result
Every command typed in a troubleshooting recovery flow resolves to a real
`openshell` subcommand. The recovery flows produce a working result on
v0.0.44.
Actual Result
Step 5 output:

  $ openshell gateway start --name nemoclaw
    error: unrecognized subcommand 'start'
    Usage: openshell gateway [OPTIONS] [COMMAND]

  $ openshell gateway trust -g nemoclaw
    error: unrecognized subcommand 'trust'
    Usage: openshell gateway [OPTIONS] [COMMAND]

  $ openshell gateway destroy
    error: unrecognized subcommand 'destroy'
    Usage: openshell gateway [OPTIONS] [COMMAND]

Step 6 (real `openshell gateway --help` on 0.0.39):

  COMMANDS
    add     Add an existing gateway
    remove  Remove a local gateway registration
    login   Authenticate with an edge-authenticated or OIDC gateway
    logout  Clear stored authentication credentials for a gateway
    select  Select the active gateway
    info    Show gateway registration details
    list    List registered gateways

→ start / trust / destroy are all absent.

Net effect: a user following the troubleshooting docs hits "unrecognized
subcommand" on the very first command of three different recovery flows.
Logs
$ openshell --version
openshell 0.0.39

$ openshell gateway list
  NAME      ENDPOINT               TYPE   AUTH
* nemoclaw  http://127.0.0.1:8080  local  plaintext

$ openshell gateway start --name nemoclaw
error: unrecognized subcommand 'start'

  tip: a similar subcommand exists: 'select'

Usage: openshell gateway [OPTIONS] [COMMAND]

For more information, try '--help'.
Suggested Fix
For each broken reference, pick the actual maintained recovery flow and
update troubleshooting.md to match. Likely correct replacements (subject to
confirmation from the OpenShell team):

(1) "Reconnect after a host reboot" step 3:
    Replace
        $ openshell gateway start --name nemoclaw
    with whatever the supported "bring the local gateway container back up"
    flow is. Candidates:
      • `docker start ` if the gateway runs as a
        long-lived docker container restored from disk state, OR
      • re-run `nemoclaw onboard --resume` to walk through gateway
        bring-up, OR
      • a sequence using `openshell gateway remove nemoclaw` followed by
        `openshell gateway add http://127.0.0.1:8080 --local --name nemoclaw`
        to re-register a still-running container.

(2) "Sandbox creation reports a TLS certificate mismatch":
    Replace
        $ openshell gateway trust -g nemoclaw
    with the actual cert-refresh path on 0.0.39. Either
      • `openshell gateway login nemoclaw` for edge-authenticated gateways
        (re-runs the login flow and re-establishes trust), OR
      • `openshell gateway remove nemoclaw` then `openshell gateway add ...`
        to re-derive and re-store the gateway's TLS material.

(3) "k3s cannot find a freshly built image" (DGX Spark):
    Replace
        $ openshell gateway destroy
        $ openshell gateway start
    with the supported "tear down and re-create the local gateway"
    sequence. Likely:
      • `openshell gateway remove nemoclaw`, then re-run `nemoclaw onboard`
        (or a docker-level restart of the gateway container).

In all three cases, please also add a short sentence saying which OpenShell
CLI version is required, so a reader on a newer OpenShell can recognize
that they may have a different command set.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Docs, NEMOCLAW_GH_SYNC_APPROVAL

[NVB#6186667]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: docsDocumentation, examples, guides, or docs buildv0.0.64Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions