Skip to content

[Ubuntu 22.04][Onboard] Multi-sandbox onboard's gateway drift detection falsely judges containerized-compat gateway as stale, recreates and collides on port 8080 #4520

@cr7258

Description

@cr7258

Description

Description

On Linux hosts whose glibc is older than the openshell-gateway binary requirement (e.g. Ubuntu 22.04 with glibc 2.35 vs. gateway requirement 2.39+), NemoClaw auto-selects the containerized-compat gateway launch mode (`docker run --rm --name nemoclaw-openshell-gateway --network host ubuntu:24.04 /opt/nemoclaw/openshell-gateway`). The first onboard succeeds and the gateway runs in a docker container with `/proc//exe = /usr/bin/docker`.

When a SECOND `nemoclaw onboard` runs on the same host (multi-sandbox flow), the product's drift detection compares the running gateway's executable path against the *host-mode* binary path and incorrectly judges the live gateway as stale. It then triggers `recreate` — spawning a NEW gateway process before the OLD one has been stopped — and the new process cannot bind port 8080 because the old container still holds it. Onboard aborts non-zero.
Environment
OS:            Ubuntu 22.04.5 LTS
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        Docker Engine 29.4.1
OpenShell CLI: openshell 0.0.44
NemoClaw:      v0.0.53
OpenClaw:      2026.5.22
glibc:         2.35 (Ubuntu)
Steps to Reproduce
Verified live on the cloud-service-qa nemoclaw-test ubuntu22 runner (10.6.11.114, glibc 2.35) at 2026-05-29.

1. Run `nemoclaw onboard --non-interactive --yes-i-accept-third-party-software` for the default sandbox (`my-assistant`). The first gateway starts in container-compat mode: `docker run … nemoclaw-openshell-gateway … ubuntu:24.04 /opt/nemoclaw/openshell-gateway`. Both gateway and sandbox become healthy.

2. Run a SECOND `nemoclaw onboard --non-interactive --yes-i-accept-third-party-software` for a different sandbox name (`my-assistant-beta-t5882265`).

3. Watch the [2/8] Starting OpenShell gateway step.
Expected Result
Second onboard reuses the existing healthy gateway. No recreate, no port 8080 collision. Both sandboxes reach Ready state and inference is reachable concurrently. (Per spec 5.3.4-sandbox-lifecycle.md: "second sandbox onboard must reuse the existing gateway"; recreating with new TLS certs would invalidate the first sandbox.)
Actual Result
Second onboard aborts non-zero. Full trace:

  [non-interactive] Agent: OpenClaw
    NemoClaw Onboarding (non-interactive mode)
    ===================
    [1/8] Preflight checks
    ✓ Docker is running / DNS / runtime ok / openshell 0.0.44 / port 8080 owned by healthy NemoClaw runtime
    [2/8] Starting OpenShell gateway
    Existing OpenShell Docker-driver gateway is stale
      (executable=/usr/bin/docker (expected /home/gitlab-runner/.local/bin/openshell-gateway));
      it will be recreated.
    ⚠ Gateway will be recreated when sandbox creation starts — this will affect running sandboxes.
    !! Port 8080 is not available.
       OpenShell gateway needs this port.
       Blocked by: openshell (PID 181769)
       Detail: sudo lsof reports openshell (PID 181769) listening on port 8080

Teardown (`nemoclaw uninstall --yes`) sequence on the same runner:
    Stopped host openshell-gateway processes 181005  ← first onboard's gateway, stopped cleanly
    Failed to stop host openshell-gateway processes 181769  ← orphaned recreate-spawned process, cannot be killed

Test impact: T5882265 beforeAll fails ("second sandbox onboard must reuse the existing gateway"), 3 sub-tests get skipped via skipRestOfSuiteAfterFailure.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard

[NVB#6240888]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: installInstall, setup, prerequisites, or uninstall flowarea: onboardingOnboarding FSM, provider setup, sandbox launch, or first-run flowarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions