Skip to content

[All Platforms][Docs] reference/architecture.md "Deployment Topology" describes k3s-pod-in-container model that doesn't match v0.0.44 docker driver #3721

@zNeill

Description

@zNeill

Description

Description

The "Deployment Topology" section of docs/reference/architecture.md
(rendered at https://docs.nvidia.com/nemoclaw/latest/reference/architecture.html
under the "Deployment Topology" heading) shows a mermaid diagram and a 5-row
"Layering" table that describes the following architecture:

  Host CLI  →  Docker daemon  →  OpenShell gateway container  →
  Embedded k3s cluster  →  Sandbox pod (Landlock + seccomp + netns)

The intro text is unconditional ("NemoClaw uses a Docker daemon. The
OpenShell gateway runs as a container that embeds a k3s cluster. The sandbox
runs as a Kubernetes pod inside that embedded cluster.").

On v0.0.44 / OpenShell 0.0.39, the actual runtime is the `docker` driver and
the deployment looks like this:

  Host CLI  →  Docker daemon
                 │
                 ├── openshell-gateway  (host PROCESS, not container)
                 │     pid 142292, /localhome/.../openshell-gateway,
                 │     endpoint http://127.0.0.1:8080,
                 │     driver = "docker"
                 │
                 └── openshell--  (Docker CONTAINER,
                      image openshell/sandbox-from:, runs the agent
                      directly — no k3s, no pod, no kubernetes scheduling)

i.e.:
  • the OpenShell gateway is a HOST process, not a Docker container
  • there is NO embedded k3s cluster anywhere
  • the sandbox is a plain Docker container, not a Kubernetes pod
  • Landlock / seccomp / netns are applied to the container, not to a pod

A reader using the docs to reason about the security boundary, the host
attack surface, or the recovery flow will form an incorrect mental model of
where each component lives.

Probable root cause: the diagram describes a non-default driver (likely the
"vm" driver, which historically did embed k3s inside a container), but the
default and the OpenShell version pinned by `nemoclaw-blueprint/blueprint.yaml`
(`min_openshell_version: "0.0.39"`, `max_openshell_version: "0.0.39"`) ships
with the docker driver active.
Environment
Device:        ipp2-1558 (10.176.178.100), x86_64 server, 32 vCPU / 125 GB RAM, NVIDIA A100 80GB PCIe
OS:            Ubuntu 24.04.4 LTS (Linux 6.17.0-23-generic)
Architecture:  x86_64
Node.js:       v22.x (installed via nvm by NemoClaw installer)
npm:           bundled
Docker:        29.5.0
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.44
OpenClaw:      N/A (docs-only bug)
Steps to Reproduce
1. Open https://docs.nvidia.com/nemoclaw/latest/reference/architecture.html
   and scroll to "Deployment Topology". Read both the mermaid diagram and
   the 5-row layering table immediately below.
2. Confirm the actual runtime on a v0.0.44 host:

     cat ~/.local/state/nemoclaw/openshell-docker-gateway/runtime.json
     ps -p $(cat ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.pid) -o pid,cmd
     docker ps --format "{{.Names}}\t{{.Image}}"

3. Look for any embedded k3s cluster:

     docker ps | grep -iE "k3s|gateway"      # nothing matches
     pgrep -af k3s                            # nothing

4. Confirm the OpenShell version pin:

     grep -E "min_openshell|max_openshell" \
       ~/.nemoclaw/source/nemoclaw-blueprint/blueprint.yaml
Expected Result
The "Deployment Topology" diagram and the layering table accurately
describe what runs where on a v0.0.44 host with the default driver. Either
the doc presents the `docker` driver as the default and labels the k3s
variant as an alternative, or it shows both side-by-side and explains when
each is used.
Actual Result
Step 2 on this host:

  runtime.json:
    "driver": "docker"
    "endpoint": "http://127.0.0.1:8080"
    "openshellVersion": "0.0.39"
    "gatewayBin": "/localhome/local-glennz/.local/bin/openshell-gateway"
    "pid": 142292

  ps:
    PID    CMD
    142292 /localhome/local-glennz/.local/bin/openshell-gateway

  docker ps:
    openshell-hermes-bug-  openshell/sandbox-from:1779088500  Up 2 hours

  →  gateway is a host process (pid 142292), not a Docker container.
  →  the only Docker container is the sandbox itself.
  →  there is no separate gateway container, no embedded k3s, no pod
     scheduler.

Step 3:
  → no k3s anywhere on the host.

Step 4:
  min_openshell_version: "0.0.39"
  max_openshell_version: "0.0.39"

  → the pinned-and-only-supported OpenShell version is 0.0.39, which is
    the version that uses the docker driver. The doc topology cannot be
    realised on the supported OpenShell version.

Specific incorrect claims in the layering table:
  • "Docker daemon ... Runs the OpenShell gateway container."
    Wrong — Docker runs the sandbox container; the gateway is a host process.
  • "Gateway container ... Docker container ... Hosts the credential store,
     the L7 proxy, and the embedded k3s control plane."
    Wrong — there is no gateway container and no embedded k3s.
  • "k3s ... Process tree inside the gateway container ... Kubernetes
     control plane that schedules the sandbox pod."
    Wrong — k3s does not run on the docker driver.
  • "Sandbox pod ... Pod in the embedded k3s cluster ..."
    Wrong — sandbox is a Docker container, not a Kubernetes pod.
Logs
$ cat ~/.local/state/nemoclaw/openshell-docker-gateway/runtime.json
{
  "version": 1,
  "pid": 142292,
  "driver": "docker",
  "platform": "linux",
  "arch": "x64",
  "endpoint": "http://127.0.0.1:8080",
  "desiredEnvHash": "eaa3f1da31b0055ba16cb068e20b2ff87cceaf470f4325361ffc3fe1a06bab35",
  "gatewayBin": "/localhome/local-glennz/.local/bin/openshell-gateway",
  "openshellVersion": "0.0.39",
  "dockerHost": "unix:///run/docker.sock",
  "createdAt": "2026-05-18T07:14:30.387Z"
}

$ docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}"
openshell-hermes-bug-ccd4651e-ff25-430c-8665-413bec910f84	openshell/sandbox-from:1779088500	Up 2 hours
Suggested Fix
Two options:

(1) Rewrite the "Deployment Topology" section to describe the docker driver
    (the default and the only driver supported by the pinned OpenShell
    0.0.39). The corrected layering table would look like:

       Host CLI       Host process (nemoclaw on Node.js)
       Docker daemon  Host service
       OpenShell      Host process (openshell-gateway), hosts the
       gateway        credential store + L7 proxy. Endpoint
                      http://127.0.0.1:8080.
       Sandbox        Docker container (openshell/sandbox-from:),
       container      runs the OpenClaw agent and NemoClaw plugin
                      under Landlock + seccomp + container netns.

    The mermaid diagram should drop the GWCON / K3S / POD nesting and
    instead show Docker daemon on the same level as the gateway process,
    with one or more sandbox containers as Docker peers of the gateway.

(2) If both drivers (`docker` and the legacy k3s-pod-in-container variant)
    are still meant to be documented, present them side-by-side as two
    deployment modes and call out which one the default OpenShell 0.0.39
    install uses. Make it clear that the diagram with embedded k3s does
    NOT apply to a default v0.0.44 install.

Either way the unconditional claim in the section intro ("NemoClaw uses a
Docker daemon. The OpenShell gateway runs as a container that embeds a k3s
cluster.") must be removed or made conditional, because it is false for
every v0.0.44 install on the pinned OpenShell 0.0.39.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Docs, NEMOCLAW_GH_SYNC_APPROVAL

[NVB#6186777]

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV QABugs found by the NVIDIA QA Teamarea: docsDocumentation, examples, guides, or docs build

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions