fix: make Dockerfile replaceConfigFile patch robust for 2026.4.24 by BenediktSchackenberg · Pull Request #2696 · NVIDIA/NemoClaw

BenediktSchackenberg · 2026-04-29T19:15:01Z

Summary

update the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex around the tryWriteSingleTopLevelIncludeMutation + writeConfigFile(params.nextConfig) block
keep the same EACCES sandbox handling behavior, but avoid brittle exact-string matching

Why

The bundled OpenClaw 2026.4.24 code path is correct, but minor formatting differences in compiled dist files can make strict literal matching fail with:

writeConfigFile(params.nextConfig) pattern not found

This change keeps the patch aligned with the 2026.4.24 API while making rebuilds resilient to formatting drift.

Summary by CodeRabbit

New Features
- Agent-branded CLI names; new list (JSON) and share (mount/unmount/status) commands; structured sandbox inventory shows agent and optional live inference.
Bug Fixes
- Stronger onboarding validations (agent binary/health, Dockerfile --from), better local/provider probing with retries/diagnostics, improved GPU reporting and TLS mismatch detection.
Chores
- Build/runtime adjustments: safer image patching, preload fix, startup gateway token, unified sandbox config layout, added timeouts and Hugging Face credential passthrough.

copy-pr-bot · 2026-04-29T19:15:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-29T19:15:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors OpenClaw runtime/build handling (new rcf_patch helper, ws-proxy preload, token handling, legacy layout cleanup), consolidates agent config paths to a single dir, introduces branding/CLI renames, adds oclif list/share commands and runner, improves local inference/Ollama handling, and broad changes to shields/state, timeouts, validation, and many tests.

Changes

Cohort / File(s)	Summary
Docker + runtime image `Dockerfile`, `ws-proxy-fix.js`, `scripts/rcf_patch.py`	Replaced inline python patch with `scripts/rcf_patch.py`; include and chmod `ws-proxy-fix.js`; clear build-time gateway token in `openclaw.json`; flatten/remove legacy symlink layouts and adjust ownership/permissions for sandbox config files.
Installer / bootstrap script `scripts/install.sh`	Agent-aware branding and shim logic; vLLM install/start support; Hermes forward recovery watcher; generalized CLI shim/verification and clone/openshell selection changes.
Package metadata `package.json`	Added runtime dependency `@oclif/core` ^4.11.0.
Branding & CLI naming `src/lib/branding.ts`, `src/nemoclaw.ts`, `src/lib/build-context.ts`, `src/lib/command-registry.ts`, `src/lib/command-registry.test.ts`	New branding module and constants; replace hardcoded `nemoclaw` strings with branding-aware `CLI_NAME`/`CLI_DISPLAY_NAME`/`AGENT_PRODUCT_NAME`; add `brandedUsage()` and adjust help/recovery text.
Agent config shape migration `src/lib/agent-defs.ts`, `src/lib/agent-defs.test.ts`, `src/lib/agent-runtime.ts`, `src/lib/agent-runtime.test.ts`, `src/lib/agent-onboard.ts`, `src/lib/agent-onboard.test.ts`, `src/lib/skill-install.ts`, `src/lib/skill-install.test.ts`, `src/lib/sandbox-config.ts`, `src/lib/sandbox-state.ts`, `src/lib/sandbox-version.test.ts`	Consolidated `immutableDir`/`writableDir` into single `dir` across manifests, code, and tests; removed config mutation/rotate features; updated backup/restore and adjusted ownership/permission semantics.
Shields & locking logic `src/lib/shields.ts`, `src/lib/shields-timer.ts`	Reworked shields into three-mode model, removed permanent shields-down, enhanced unlock/lock verification to resolve agent config and sensitive files, and tightened post-op checks.
Onboarding & Ollama/model handling `src/lib/onboard.ts`, `src/lib/onboard-inference-probes.ts`, `src/lib/onboard-ollama-proxy.ts`, `src/lib/onboard-command.ts`	Branding-aware onboarding, removed `dangerously-skip-permissions`, added build-context validation/denylist, TLS cert-mismatch classification, agent drift handling, HTTP-streamed Ollama pulls, proxy health/unload, and probe retry wrapper for 429 throttling.
Local inference & diagnostics `src/lib/local-inference.ts`, `src/lib/local-inference.test.ts`	WSL-aware Ollama host discovery and caching, retries/backoff for Docker reachability with diagnostics, explicit curl timeouts, and exported helpers for host resolution and diagnostics.
Oclif commands & runner `src/lib/oclif-commands.ts`, `src/lib/oclif-runner.ts`, `src/lib/list-command.ts`, `src/lib/list-command-deps.ts`	Add programmatic oclif runner, `list` command, deps builder for list (live inference capture, active session counting), and command registry wiring.
Share command & deps `src/lib/share-command.ts`, `src/lib/share-command-deps.ts`	New `share` oclif command implementing `mount
Inventory / list refactor `src/lib/inventory-commands.ts`, `src/lib/inventory-commands.test.ts`	Add structured inventory types, `getSandboxInventory`, and `renderSandboxInventoryText`; refactor `list`/status to use these and include `agent` column and optional live inference.
Validation, timeouts & resolve helpers `src/lib/validation.ts`, `src/lib/validation.test.ts`, `src/lib/openshell-timeouts.ts`, `src/lib/openshell-timeouts.test.ts`, `src/lib/resolve-openshell.ts`, `src/lib/resolve-openshell.test.ts`	Add `tls_cert_mismatch` classification, centralized openshell timeout constants with tests, and env override for openshell binary resolution.
Credentials & deploy `src/lib/credentials.ts`, `src/lib/deploy.ts`, `src/lib/deploy.test.ts`	Allow `HF_TOKEN` and `HUGGING_FACE_HUB_TOKEN` as known credential keys and propagate them through DeployCredentials and build env lines.
NIM GPU detection `src/lib/nim.ts`, `src/lib/nim.test.ts`	Parse per-GPU memory and model names robustly; only set aggregated `name` when GPUs share the same model; add tests for edge cases.
Commands, registry & misc services `src/lib/command-registry.ts`, `src/lib/registry.ts`, `src/lib/services.ts`, `src/lib/policies.ts`	Add `share` subcommands; remove hidden sandbox config commands and `dangerouslySkipPermissions` persistence; unload Ollama models on stop/destroy; update banners to branding.
Build context & staged assets `src/lib/sandbox-build-context.ts`	Include `scripts/rcf_patch.py` and additional assets in staged build context; make `collectBuildContextStats` accept an include predicate.
Many tests updated multiple `src/lib/*.test.ts`	Widespread test updates to reflect branding, config-path shape, preload guards, NODE_OPTIONS checks, changed model IDs, and other behavioral changes.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI
  participant Oclif as Oclif Runner
  participant Deps as ListCommandDeps
  participant Openshell
  participant Renderer
  rect rgba(200,200,255,0.5)
  User->>CLI: run `cli list`
  CLI->>Oclif: runRegisteredOclifCommand("list", args)
  Oclif->>Deps: buildListCommandDeps()
  Deps->>Openshell: probe sandboxes / capture live inference
  Openshell-->>Deps: sandbox list + inference JSON
  Deps-->>Oclif: inventory result
  Oclif->>Renderer: renderSandboxInventoryText(inventory, liveInference)
  Renderer-->>User: formatted output
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped into the build and stitched a patch,
A helper script to fix a tricky match.
I folded old paths into one neat nest,
Branded the CLI and gave commands their best.
Now sandboxes hum — time for carrot-snack! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main change: making the Dockerfile's replaceConfigFile patch more robust by using a whitespace-tolerant regex approach for OpenClaw 2026.4.24, which directly addresses the core problem described in the objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

Dockerfile (1)
152-217: Please gate this Dockerfile patch block with the recommended E2E suite.

Because this mutates bundled OpenClaw dist at image build time, run cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e before merge.

As per coding guidelines, "Dockerfile: This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 152 - 217, This Dockerfile introduces
runtime-altering patches to OpenClaw (see OC_DIST and patched symbols like
withStrictGuardedFetchMode → withTrustedEnvProxyGuardedFetchMode,
assertExplicitProxyAllowed, replaceConfigFile, and
DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS), so gate the PR with the E2E suites before
merge: add required CI status checks or a pre-merge pipeline that runs
cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e and block
merging until they pass (ensure the check names match these suites and attach
them to this Dockerfile change/patch block).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 152-217: This Dockerfile introduces runtime-altering patches to
OpenClaw (see OC_DIST and patched symbols like withStrictGuardedFetchMode →
withTrustedEnvProxyGuardedFetchMode, assertExplicitProxyAllowed,
replaceConfigFile, and DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS), so gate the PR
with the E2E suites before merge: add required CI status checks or a pre-merge
pipeline that runs cloud-e2e, sandbox-survival-e2e, hermes-e2e, and
rebuild-openclaw-e2e and block merging until they pass (ensure the check names
match these suites and attach them to this Dockerfile change/patch block).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 331ef204-4746-4742-973c-b28b69576d11

📥 Commits

Reviewing files that changed from the base of the PR and between d753c18 and 06a58f1.

📒 Files selected for processing (1)

Dockerfile

Copilot

Pull request overview

Updates NemoClaw’s Docker build-time patching of bundled OpenClaw dist code so the replaceConfigFile patch is less sensitive to formatting differences in OpenClaw 2026.4.24, reducing rebuild failures when the compiled JS layout changes.

Changes:

Replace a brittle exact-string match for the tryWriteSingleTopLevelIncludeMutation → writeConfigFile(params.nextConfig) block with a regex-based substitution.
Preserve the existing sandbox behavior: catch EACCES when OPENSHELL_SANDBOX=1 and emit a warning instead of crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T19:17:41Z

    rcf_file="$(grep -RIlE --include='*.js' 'async function replaceConfigFile\(params\)' "$OC_DIST" | head -n 1)"; \
    test -n "$rcf_file" || { echo "ERROR: replaceConfigFile function not found in OpenClaw dist" >&2; exit 1; }; \
-    python3 -c "import sys; p=sys.argv[1]; f=open(p); src=f.read(); f.close(); old='\tif (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t});'; new='\ttry { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; assert old in src, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; f=open(p,'w'); f.write(src.replace(old,new,1)); f.close()" "$rcf_file"; \
+    python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if \(!await tryWriteSingleTopLevelIncludeMutation\(\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}\)\) await writeConfigFile\(params\.nextConfig, \{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}\);'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \


The new regex pattern still hard-codes some literal spaces (e.g., if $ and , \{) and requires a trailing });, so it can still fail with minor formatting/minification drift (the same problem this PR is trying to solve). Make the pattern fully whitespace-tolerant around tokens like if, (, {, commas, and allow an optional semicolon (e.g., use if\s*\(, \(\s*\{, ,\s*\{, and \}$\s*;?).

Suggested change

python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if $!await tryWriteSingleTopLevelIncludeMutation\(\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}$\) await writeConfigFile$params\.nextConfig, \{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}$;'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \

python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if\s*$!await tryWriteSingleTopLevelIncludeMutation\(\s*\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}\s*$\s*\)\s*await writeConfigFile$params\.nextConfig,\s*\{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}$\s*;?'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \

robverbeek66 · 2026-04-29T19:33:12Z

Confirming this exact failure on a fresh nemoclaw onboard (Ubuntu 24.04, OpenShell 0.0.36, OpenClaw 2026.4.24 in base image). The build fails at Step 17/56 with the same writeConfigFile(params.nextConfig) pattern not found assertion. Reproduces with both OpenAI (gpt-5.4-mini) and local Ollama (qwen2.5:14b) provider selections — the failure is upstream of provider configuration. Looking forward to seeing this merged.

…ness Copilot review on NVIDIA#2696 flagged that even the regex-based inline python one-liner is still fragile and hard to maintain. Extract the patch logic to scripts/rcf_patch.py which uses a broad DOTALL regex anchored on the function-call names rather than exact whitespace, making it resilient to formatting changes across OpenClaw versions. Per Copilot review on PR NVIDIA#2696. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

BenediktSchackenberg · 2026-04-29T19:45:08Z

Per Copilot review — extracted the patch logic to scripts/rcf_patch.py instead of keeping it as an inline Python one-liner in the Dockerfile. The external script uses a broad DOTALL regex anchored on the function-call names (tryWriteSingleTopLevelIncludeMutation, writeConfigFile) rather than exact whitespace or argument ordering, so minor formatting drift across OpenClaw versions won't cause false misses.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/rcf_patch.py`:
- Around line 16-23: The regex is too broad and spacing-sensitive: change the
search to first extract only the replaceConfigFile function body (anchor on the
function declaration name replaceConfigFile) and then run the pattern against
that substring, and relax the "!await" match to allow whitespace (use
"!\\s*await" instead of "!await") so it matches variants with spaces; keep the
rest of the pattern targeting the tryWriteSingleTopLevelIncludeMutation and
writeConfigFile call sites to ensure only the intended block is patched.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c35d3a1a-de02-4523-ac54-a8d86bd88f81

📥 Commits

Reviewing files that changed from the base of the PR and between 06a58f1 and c59803c.

📒 Files selected for processing (2)

Dockerfile
scripts/rcf_patch.py

🚧 Files skipped from review as they are similar to previous changes (1)

Dockerfile

wscurran · 2026-04-30T14:57:17Z

✨ Thanks for submitting this PR that proposes an update to the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex. This change aims to improve the robustness of the patch and make rebuilds more resilient to formatting drift.

wscurran · 2026-04-30T14:57:17Z

✨ Thanks for submitting this PR that proposes an update to the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex. This change aims to improve the robustness of the patch and make rebuilds more resilient to formatting drift.

…aceConfigFile Two issues from review on PR NVIDIA#2696: 1. SCRIPT_DIR is not defined in the Docker build context so the previous call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail. Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py before the RUN block and call it via the absolute in-image path. 2. The regex in rcf_patch.py could still miss due to ! await spacing and was not scoped to the replaceConfigFile function body. Fix: scope the match to the function body (walk braces to find it), and make all token separators whitespace-tolerant (! \s* await, \(\s*\{, ,\s*\{, optional trailing semicolon). Per CodeRabbit/Copilot review on PR NVIDIA#2696. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

BenediktSchackenberg · 2026-05-01T07:19:18Z

Two fixes pushed:

Added COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py before the RUN block and updated the call to use the absolute in-image path — SCRIPT_DIR was never defined in the build context
Scoped the regex to the replaceConfigFile function body (walks braces to find it) and made all token separators whitespace-tolerant including ! await spacing and optional trailing semicolon

coderabbitai

🧹 Nitpick comments (1)

Dockerfile (1)
152-198: Please run the Dockerfile E2E matrix before merge.

Given this changes image-layer patching behavior, run: cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e on this branch.

As per coding guidelines, "Dockerfile: This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 152 - 198, This change alters Dockerfile image-layer
patching (patches touching OC_DIST, rcf_patch.py, replaceConfigFile,
OPENSHELL_SANDBOX behavior and install-safe-path/install-package-dir edits) and
must be validated by running the full Dockerfile E2E matrix before merging: run
cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e against
this branch (build the container to exercise the sed/python patches and check
replaceConfigFile EACCES warning, install-safe-path/stat and install-package-dir
symlink behavior), capture logs/artifacts for any failures, and post the passing
results or failing test output to the PR for reviewer signoff.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 152-198: This change alters Dockerfile image-layer patching
(patches touching OC_DIST, rcf_patch.py, replaceConfigFile, OPENSHELL_SANDBOX
behavior and install-safe-path/install-package-dir edits) and must be validated
by running the full Dockerfile E2E matrix before merging: run cloud-e2e,
sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e against this branch
(build the container to exercise the sed/python patches and check
replaceConfigFile EACCES warning, install-safe-path/stat and install-package-dir
symlink behavior), capture logs/artifacts for any failures, and post the passing
results or failing test output to the PR for reviewer signoff.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5779a2a5-88a1-490d-a48b-332e1515c7f5

📥 Commits

Reviewing files that changed from the base of the PR and between c59803c and b0edc0f.

📒 Files selected for processing (2)

Dockerfile
scripts/rcf_patch.py

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/rcf_patch.py

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/lib/services.ts (1)

304-324: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the banner within the box width.

The messaging line is longer than the current 52-character width once AGENT_PRODUCT_NAME is inserted, so the right border will drift and the ASCII box will break.

Either widen the box or compute the banner width from the longest rendered line.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.ts` around lines 304 - 324, The banner's fixed pad width
(padEnd(52)) in the services display causes the right border to misalign when
AGENT_PRODUCT_NAME makes the line longer; update the rendering to compute the
box width dynamically from the longest rendered line (use the strings built for
CLI_DISPLAY_NAME + " Services", the Public URL line when tunnelUrl exists, and
the Messaging line that concatenates AGENT_PRODUCT_NAME) and then pad each line
to that computed width (instead of hardcoded 52) before printing the top/bottom
borders and side bars; adjust the padEnd calls and border strings accordingly so
CLI_DISPLAY_NAME, AGENT_PRODUCT_NAME, and tunnelUrl are all referenced in the
width calculation and stay within the box.

src/lib/onboard.ts (2)

3658-3666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard malformed CHAT_UI_URL in the second parse path.

If CHAT_UI_URL is malformed, the first parse is safely caught, but this second parse can still throw and abort sandbox creation unexpectedly.

Suggested patch

   if (process.env.CHAT_UI_URL && controlUiPort == null) {
-    const parsed = new URL(
-      process.env.CHAT_UI_URL.includes("://")
-        ? process.env.CHAT_UI_URL
-        : `http://${process.env.CHAT_UI_URL}`,
-    );
-    parsed.port = String(effectivePort);
-    chatUiUrl = parsed.toString().replace(/\/$/, "");
+    try {
+      const parsed = new URL(
+        process.env.CHAT_UI_URL.includes("://")
+          ? process.env.CHAT_UI_URL
+          : `http://${process.env.CHAT_UI_URL}`,
+      );
+      parsed.port = String(effectivePort);
+      chatUiUrl = parsed.toString().replace(/\/$/, "");
+    } catch {
+      console.warn("  Invalid CHAT_UI_URL; falling back to localhost dashboard URL.");
+      chatUiUrl = `http://127.0.0.1:${effectivePort}`;
+    }
   } else {
     chatUiUrl = `http://127.0.0.1:${effectivePort}`;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3658 - 3666, The second URL construction
using new URL(...) when process.env.CHAT_UI_URL lacks "://" can throw on
malformed values; update the block that sets parsed (the branch where
controlUiPort == null) to validate or wrap the parse in a try/catch (around new
URL and subsequent parsed.port assignment) and handle failures by falling back
to a safe default or skipping chatUiUrl assignment so sandbox creation doesn't
abort; reference the variables/idents involved: process.env.CHAT_UI_URL,
controlUiPort, parsed, effectivePort, and chatUiUrl.

1545-1545: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Broaden ARG detection to support valid Dockerfile syntax variants.

agentSupportsWebSearch() currently only matches ARG NEMOCLAW_WEB_SEARCH_ENABLED=.... A Dockerfile with ARG NEMOCLAW_WEB_SEARCH_ENABLED (no default) will be treated as unsupported even though the ARG is present.

Suggested patch

-      return /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m.test(content);
+      return /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?\s*$/m.test(content);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` at line 1545, The function agentSupportsWebSearch()
currently only detects ARG lines that include an equals sign; update its regex
to also match the valid Dockerfile form without a default value. Replace the
current /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m test with a pattern that
allows either an equals sign with an optional value or the end of the token (for
example matching /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?/m or
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=|$)/m), ensuring whitespace is
tolerated; keep the change localized to agentSupportsWebSearch().

🧹 Nitpick comments (4)

src/lib/sandbox-state.ts (1)
559-566: Run the sandbox-state E2Es before merge.

Given these lifecycle-path changes, run snapshot-commands-e2e and rebuild-openclaw-e2e on this branch to validate backup/restore behavior end-to-end.

As per coding guidelines, src/lib/sandbox-state.ts: “E2E test recommendation: snapshot-commands-e2e — snapshot create/list/restore lifecycle; rebuild-openclaw-e2e — workspace state survives rebuild.”

Also applies to: 803-815
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sandbox-state.ts` around lines 559 - 566, Run the end-to-end tests to
validate backup/restore behavior for the lifecycle-path changes: on this branch
execute the snapshot-commands-e2e (snapshot create/list/restore lifecycle) and
rebuild-openclaw-e2e (workspace state survives rebuild) suites and confirm they
pass; focus validation around the backup/restore flow exercised by
backupSandboxState (and the related code around lines ~803-815) to ensure
sandbox state backup and restore work correctly after your changes before
merging.
Dockerfile (1)
371-495: Please run the Dockerfile E2E matrix for this change set.

These image-layer and runtime-permission changes should be validated via cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e before merge.

As per coding guidelines, Dockerfile: “E2E test recommendation: cloud-e2e, sandbox-survival-e2e, hermes-e2e, rebuild-openclaw-e2e.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 371 - 495, The PR touches Dockerfile image-layer and
runtime-permission logic (the large RUN migration block and subsequent RUN
chown/chmod and sha256sum steps) but the reviewer requests you run the required
E2E matrices before merging; please execute cloud-e2e, sandbox-survival-e2e,
hermes-e2e, and rebuild-openclaw-e2e against this image change, capture and
attach the failing/passing job logs, and only merge if all pass (or document any
accepted failures with rationale), referencing the Dockerfile RUN migration
block, the chown/chmod lines, and the .config-hash creation so reviewers can
correlate test results to the exact changes.
src/nemoclaw.ts (1)
3835-3922: Run the sandbox lifecycle E2E suite before merge.

Given lifecycle-touching changes in snapshot/gateway probing and recovery paths, run the recommended targeted nightly jobs pre-merge to reduce regression risk:
sandbox-survival-e2e, sandbox-operations-e2e, skip-permissions-e2e.

As per coding guidelines, for src/nemoclaw.ts: “E2E test recommendation: sandbox-survival-e2e, sandbox-operations-e2e, skip-permissions-e2e.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 3835 - 3922, The PR touches snapshot/gateway
probing and recovery (functions probeGatewayRunning and sandboxSnapshot) and per
the repo guidelines you must run the targeted E2E suites before merging: run
sandbox-survival-e2e, sandbox-operations-e2e, and skip-permissions-e2e
(nightly/targeted) and attach their green CI artifacts; if any test fails,
reproduce locally against your branch and fix regressions in
probeGatewayRunning/sandboxSnapshot before re-running tests. Ensure the test run
names and results are included in the PR description or CI status so reviewers
can confirm the required E2E suites passed.
src/lib/onboard.ts (1)
7685-8359: Run the targeted E2E matrix for this onboarding change set before merge.

Given the breadth of behavior changes in core onboarding, run cloud-e2e, sandbox-operations-e2e, and rebuild-openclaw-e2e on this branch to reduce regression risk around resume/reuse/rebuild paths.

As per coding guidelines: src/lib/onboard.ts is core onboarding logic and recommends these specific E2E jobs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 7685 - 8359, This change touches core
onboarding logic in function onboard (src/lib/onboard.ts), so before merging run
the targeted E2E matrix to catch regressions: execute cloud-e2e,
sandbox-operations-e2e, and rebuild-openclaw-e2e on this branch and verify
resume/reuse/rebuild paths (resume, gateway/sandbox reuse, inference/provider
selection, sandbox create/rebuild, and agent/openclaw setup) all pass; if any
fail, reproduce locally against the failing step in onboard (e.g.,
provider_selection, inference, sandbox, gateway, openclaw) and fix the failing
logic before merging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/install.sh`:
- Around line 320-362: The watcher is being launched and later treated as a
success even when curl is unavailable, causing false success and restart thrash;
guard watcher creation and the success check by ensuring curl is present: before
writing watcher_script and spawning the detached Node watcher (the block that
defines watcher_script and the node -e spawn), add a check that command_exists
curl is true and only then start the watcher, and remove or change the later
branch that returns 0 when curl is missing (the block that reads watcher_pid and
returns 0 if !command_exists curl && [[ -n "$watcher_pid" ]] ...) so we do not
treat a running watcher as healthy without curl. Use the symbols watcher_script,
tick/healthy (in the embedded script), the node -e spawn invocation, and
watcher_pid to locate and update the logic.

In `@src/lib/build-context.ts`:
- Around line 76-83: The TLS recovery hint prints a hard-coded gateway name
("nemoclaw") which mixes branding; in the failure.kind === "tls_cert_mismatch"
branch replace the literal "openshell gateway trust -g nemoclaw" string with one
that uses the CLI_NAME constant (e.g. use string interpolation so the command
reads `${CLI_NAME} gateway trust -g ${CLI_NAME}` or similar) so the trust hint
and the resume hint both use CLI_NAME consistently.

In `@src/lib/local-inference.ts`:
- Around line 79-80: getResolvedOllamaHost() currently returns
_resolvedOllamaHost or OLLAMA_LOCALHOST without attempting to resolve the actual
reachable host; update it to call findReachableOllamaHost() when
_resolvedOllamaHost is unset so WSL/docker cases (e.g., host.docker.internal)
are discovered before falling back to OLLAMA_LOCALHOST; specifically, inside
getResolvedOllamaHost() call findReachableOllamaHost(), await or handle its
result as appropriate, assign that value to _resolvedOllamaHost, and then return
_resolvedOllamaHost or OLLAMA_LOCALHOST as a final fallback.

In `@src/lib/nim.ts`:
- Around line 96-113: The perGpuMB field currently uses parsed[0].memoryMB which
misreports heterogeneous hosts; update the logic in the block that builds the
return object (using parsed: ParsedGpu[], allSameName, canRunNimWithMemory) to
only set perGpuMB when every GPU has the same memory (compute a new
allSameMemory check like parsed.every(p => p.memoryMB === parsed[0].memoryMB))
or else omit/leave perGpuMB undefined; leave totalMemoryMB and nimCapable as-is.

In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 497-548: unloadOllamaModels currently fires the /api/ps GET and
spawns /api/generate POSTs but returns immediately; change unloadOllamaModels to
return a Promise that resolves after the /api/ps response is fully parsed and
all unload POSTs have completed (or rejected on fatal error). Implement by
wrapping the initial http.get call in a Promise, collecting per-unload request
Promises (resolve on 'finish' or 'end' and reject on 'error' or timeout) for
each unloadReq created in the loop, and resolve the main Promise only after
Promise.all of those unload Promises settles; keep the best-effort semantics by
handling individual unload errors but ensure the returned Promise only settles
when all unload attempts have finished so callers can await it.

In `@src/lib/onboard.ts`:
- Around line 4582-4589: The vllm-local ambiguity is being collapsed too early
by providerNameToOptionKey (or the name === "vllm-local" mapping) because it
maps "vllm-local" to "vllm" whenever opts.hasNimContainer is falsy, which
prevents the downstream ambiguity branch from running during non-interactive
recovery; update the mapping logic (providerNameToOptionKey or the initial name
normalization) to preserve the "vllm-local" sentinel when recovery is
non-interactive or when ambiguity needs to be resolved (e.g., check
opts.recoveryNonInteractive or equivalent) so the later code that inspects
opts.hasNimContainer and the ambiguity branch can run, and only collapse
"vllm-local" to "vllm" when it is safe (explicit hasNimContainer true/false in
interactive flows) — adjust callers referencing name === "vllm-local" and
opts.hasNimContainer accordingly to handle the preserved sentinel.

In `@src/lib/sandbox-build-context.ts`:
- Around line 78-80: The optimized build context is missing scripts/rcf_patch.py
caused by the whitelist used when creating the optimized root-scripts; update
the whitelist logic in sandbox-build-context.ts (the code that copies from
sourceBlueprintDir to stagedBlueprintDir using fs.cpSync and path.join) to
include "scripts/rcf_patch.py" so the optimized build context contains that
file; locate the whitelist/allowedFiles array or function that filters
root-scripts and add "scripts/rcf_patch.py" (or broaden the scripts/* inclusion)
so the Dockerfile COPY for scripts/rcf_patch.py will find the file during
optimized builds.

In `@src/nemoclaw.ts`:
- Around line 3852-3855: The error message printed when the Docker container
probe fails is misleadingly saying "Failed to query live sandbox state from
OpenShell"; update the messaging in the probe failure branches (the check that
calls probeGatewayRunning() around the if block and the similar branch at lines
handling the Docker probe near 3917-3920) to clearly indicate a Docker
probe/container health check failure (e.g., mention "Docker probe" or "container
probe" and the probe function name probeGatewayRunning) so operators know the
failure originated from the Docker/container probe rather than an OpenShell
query; modify both branches to use consistent, explicit text and include any
available probe error details if present.
- Around line 3838-3845: The gateway probe uses dockerInspect in
probeGatewayRunning and lacks a timeout; modify the dockerInspect invocation to
pass the shared OPENSHELL_PROBE_TIMEOUT_MS as the operation timeout (alongside
the existing ignoreError and suppressOutput options) so the call fails fast when
Docker is unresponsive; update the dockerInspect call in probeGatewayRunning to
include the timeout option and keep the existing return logic based on
result.status and result.stdout.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3658-3666: The second URL construction using new URL(...) when
process.env.CHAT_UI_URL lacks "://" can throw on malformed values; update the
block that sets parsed (the branch where controlUiPort == null) to validate or
wrap the parse in a try/catch (around new URL and subsequent parsed.port
assignment) and handle failures by falling back to a safe default or skipping
chatUiUrl assignment so sandbox creation doesn't abort; reference the
variables/idents involved: process.env.CHAT_UI_URL, controlUiPort, parsed,
effectivePort, and chatUiUrl.
- Line 1545: The function agentSupportsWebSearch() currently only detects ARG
lines that include an equals sign; update its regex to also match the valid
Dockerfile form without a default value. Replace the current
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m test with a pattern that allows either
an equals sign with an optional value or the end of the token (for example
matching /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?/m or
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=|$)/m), ensuring whitespace is
tolerated; keep the change localized to agentSupportsWebSearch().

In `@src/lib/services.ts`:
- Around line 304-324: The banner's fixed pad width (padEnd(52)) in the services
display causes the right border to misalign when AGENT_PRODUCT_NAME makes the
line longer; update the rendering to compute the box width dynamically from the
longest rendered line (use the strings built for CLI_DISPLAY_NAME + " Services",
the Public URL line when tunnelUrl exists, and the Messaging line that
concatenates AGENT_PRODUCT_NAME) and then pad each line to that computed width
(instead of hardcoded 52) before printing the top/bottom borders and side bars;
adjust the padEnd calls and border strings accordingly so CLI_DISPLAY_NAME,
AGENT_PRODUCT_NAME, and tunnelUrl are all referenced in the width calculation
and stay within the box.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 371-495: The PR touches Dockerfile image-layer and
runtime-permission logic (the large RUN migration block and subsequent RUN
chown/chmod and sha256sum steps) but the reviewer requests you run the required
E2E matrices before merging; please execute cloud-e2e, sandbox-survival-e2e,
hermes-e2e, and rebuild-openclaw-e2e against this image change, capture and
attach the failing/passing job logs, and only merge if all pass (or document any
accepted failures with rationale), referencing the Dockerfile RUN migration
block, the chown/chmod lines, and the .config-hash creation so reviewers can
correlate test results to the exact changes.

In `@src/lib/onboard.ts`:
- Around line 7685-8359: This change touches core onboarding logic in function
onboard (src/lib/onboard.ts), so before merging run the targeted E2E matrix to
catch regressions: execute cloud-e2e, sandbox-operations-e2e, and
rebuild-openclaw-e2e on this branch and verify resume/reuse/rebuild paths
(resume, gateway/sandbox reuse, inference/provider selection, sandbox
create/rebuild, and agent/openclaw setup) all pass; if any fail, reproduce
locally against the failing step in onboard (e.g., provider_selection,
inference, sandbox, gateway, openclaw) and fix the failing logic before merging.

In `@src/lib/sandbox-state.ts`:
- Around line 559-566: Run the end-to-end tests to validate backup/restore
behavior for the lifecycle-path changes: on this branch execute the
snapshot-commands-e2e (snapshot create/list/restore lifecycle) and
rebuild-openclaw-e2e (workspace state survives rebuild) suites and confirm they
pass; focus validation around the backup/restore flow exercised by
backupSandboxState (and the related code around lines ~803-815) to ensure
sandbox state backup and restore work correctly after your changes before
merging.

In `@src/nemoclaw.ts`:
- Around line 3835-3922: The PR touches snapshot/gateway probing and recovery
(functions probeGatewayRunning and sandboxSnapshot) and per the repo guidelines
you must run the targeted E2E suites before merging: run sandbox-survival-e2e,
sandbox-operations-e2e, and skip-permissions-e2e (nightly/targeted) and attach
their green CI artifacts; if any test fails, reproduce locally against your
branch and fix regressions in probeGatewayRunning/sandboxSnapshot before
re-running tests. Ensure the test run names and results are included in the PR
description or CI status so reviewers can confirm the required E2E suites
passed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8508b574-b2f5-4515-88b1-320ecf0cb880

📥 Commits

Reviewing files that changed from the base of the PR and between b0edc0f and 9f38794.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (53)

Dockerfile
package.json
scripts/install.sh
src/lib/agent-defs.test.ts
src/lib/agent-defs.ts
src/lib/agent-onboard.test.ts
src/lib/agent-onboard.ts
src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/lib/branding.ts
src/lib/build-context.ts
src/lib/command-registry.test.ts
src/lib/command-registry.ts
src/lib/credentials.ts
src/lib/deploy.test.ts
src/lib/deploy.ts
src/lib/inference-config.test.ts
src/lib/inference-config.ts
src/lib/inventory-commands.test.ts
src/lib/inventory-commands.ts
src/lib/list-command-deps.ts
src/lib/list-command.ts
src/lib/local-inference.test.ts
src/lib/local-inference.ts
src/lib/nim.test.ts
src/lib/nim.ts
src/lib/oclif-commands.ts
src/lib/oclif-runner.ts
src/lib/onboard-command.test.ts
src/lib/onboard-command.ts
src/lib/onboard-inference-probes.ts
src/lib/onboard-ollama-proxy.ts
src/lib/onboard.ts
src/lib/openshell-timeouts.test.ts
src/lib/openshell-timeouts.ts
src/lib/policies.ts
src/lib/registry.ts
src/lib/resolve-openshell.test.ts
src/lib/resolve-openshell.ts
src/lib/sandbox-build-context.ts
src/lib/sandbox-config.ts
src/lib/sandbox-state.ts
src/lib/sandbox-version.test.ts
src/lib/services.ts
src/lib/share-command-deps.ts
src/lib/share-command.ts
src/lib/shields-timer.ts
src/lib/shields.ts
src/lib/skill-install.test.ts
src/lib/skill-install.ts
src/lib/validation.test.ts
src/lib/validation.ts
src/nemoclaw.ts

💤 Files with no reviewable changes (3)

src/lib/inference-config.test.ts
src/lib/inference-config.ts
src/lib/registry.ts

✅ Files skipped from review due to trivial changes (5)

package.json
src/lib/deploy.test.ts
src/lib/openshell-timeouts.ts
src/lib/sandbox-version.test.ts
src/lib/deploy.ts

coderabbitai · 2026-05-01T09:03:07Z

+  if (failure.kind === "tls_cert_mismatch") {
+    console.error(
+      "  Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",
+    );
+    console.error("  Fix:  openshell gateway trust -g nemoclaw");
+    console.error(`  Then: ${CLI_NAME} onboard --resume`);
+    return;
+  }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the TLS recovery hint branding-consistent.

This branch already uses CLI_NAME for the resume step, but the trust command still hard-codes nemoclaw. That produces a mixed recovery flow for branded runs.

Suggested fix

- console.error(" Fix: openshell gateway trust -g nemoclaw"); + console.error(` Fix: openshell gateway trust -g ${CLI_NAME}`);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (failure.kind === "tls_cert_mismatch") {

console.error(

" Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",

);

console.error(" Fix: openshell gateway trust -g nemoclaw");

console.error(` Then: ${CLI_NAME} onboard --resume`);

return;

}

if (failure.kind === "tls_cert_mismatch") {

console.error(

" Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",

);

console.error(` Fix: openshell gateway trust -g ${CLI_NAME}`);

console.error(` Then: ${CLI_NAME} onboard --resume`);

return;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/build-context.ts` around lines 76 - 83, The TLS recovery hint prints a hard-coded gateway name ("nemoclaw") which mixes branding; in the failure.kind === "tls_cert_mismatch" branch replace the literal "openshell gateway trust -g nemoclaw" string with one that uses the CLI_NAME constant (e.g. use string interpolation so the command reads `${CLI_NAME} gateway trust -g ${CLI_NAME}` or similar) so the trust hint and the resume hint both use CLI_NAME consistently.

coderabbitai · 2026-05-01T09:03:07Z

+export function getResolvedOllamaHost(): string {
+  return _resolvedOllamaHost ?? OLLAMA_LOCALHOST;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve the Ollama host before falling back to localhost.

getResolvedOllamaHost() never calls findReachableOllamaHost(), so all of the new WSL-aware callers still default to 127.0.0.1 unless some unrelated path happened to seed the cache first. On WSL setups where Ollama is only reachable via host.docker.internal, validation, model discovery, warmup, and probe flows will still fail against localhost and the new fallback never engages.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/local-inference.ts` around lines 79 - 80, getResolvedOllamaHost() currently returns _resolvedOllamaHost or OLLAMA_LOCALHOST without attempting to resolve the actual reachable host; update it to call findReachableOllamaHost() when _resolvedOllamaHost is unset so WSL/docker cases (e.g., host.docker.internal) are discovered before falling back to OLLAMA_LOCALHOST; specifically, inside getResolvedOllamaHost() call findReachableOllamaHost(), await or handle its result as appropriate, assign that value to _resolvedOllamaHost, and then return _resolvedOllamaHost or OLLAMA_LOCALHOST as a final fallback.

coderabbitai · 2026-05-01T09:03:07Z

+      if (parsed.length > 0) {
+        const totalMemoryMB = parsed.reduce(
+          (sum: number, p: ParsedGpu) => sum + p.memoryMB,
+          0,
+        );
+        const firstName = parsed[0].name;
+        // Only surface a single name when every GPU reports the same model;
+        // a mixed-GPU host would otherwise be misreported as `Nx <firstName>`.
+        const allSameName =
+          !!firstName && parsed.every((p: ParsedGpu) => p.name === firstName);
        return {
          type: "nvidia",
-          count: perGpuMB.length,
+          ...(allSameName ? { name: firstName } : {}),
+          count: parsed.length,
          totalMemoryMB,
-          perGpuMB: perGpuMB[0],
+          perGpuMB: parsed[0].memoryMB,
          nimCapable: canRunNimWithMemory(totalMemoryMB),
        };


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid using the first GPU’s memory on heterogeneous hosts.

Omitting name for mixed-model machines prevents one kind of misreporting, but perGpuMB still comes from parsed[0], so mixed-memory hosts can surface the wrong per-GPU size.

Consider deriving perGpuMB from the parsed set or leaving it unset when the GPUs don’t match.

Suggested fix

- perGpuMB: parsed[0].memoryMB, + perGpuMB: allSameName ? parsed[0].memoryMB : Math.floor(totalMemoryMB / parsed.length),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/nim.ts` around lines 96 - 113, The perGpuMB field currently uses parsed[0].memoryMB which misreports heterogeneous hosts; update the logic in the block that builds the return object (using parsed: ParsedGpu[], allSameName, canRunNimWithMemory) to only set perGpuMB when every GPU has the same memory (compute a new allSameMemory check like parsed.every(p => p.memoryMB === parsed[0].memoryMB)) or else omit/leave perGpuMB undefined; leave totalMemoryMB and nimCapable as-is.

coderabbitai · 2026-05-01T09:03:07Z

+function unloadOllamaModels() {
+  try {
+    const req = http.get(
+      {
+        hostname: "localhost",
+        port: OLLAMA_PORT,
+        path: "/api/ps",
+        timeout: 3000,
+      },
+      (res) => {
+        let data = "";
+        res.on("data", (chunk) => {
+          data += chunk;
+        });
+        res.on("end", () => {
+          if (res.statusCode !== 200) return;
+          try {
+            const parsed = JSON.parse(data);
+            const models = parsed.models || [];
+            for (const entry of models) {
+              if (!entry.name) continue;
+              const unloadReq = http.request(
+                {
+                  hostname: "localhost",
+                  port: OLLAMA_PORT,
+                  path: "/api/generate",
+                  method: "POST",
+                  timeout: 3000,
+                  headers: { "Content-Type": "application/json" },
+                },
+                () => {
+                  /* ignore response */
+                },
+              );
+              unloadReq.on("error", () => {
+                /* best-effort */
+              });
+              unloadReq.write(JSON.stringify({ model: entry.name, keep_alive: 0 }));
+              unloadReq.end();
+            }
+          } catch {
+            /* best-effort */
+          }
+        });
+      },
+    );
+    req.on("error", () => {
+      /* best-effort */
+    });
+  } catch {
+    /* best-effort */
+  }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Wait for the unload requests to finish before returning.

unloadOllamaModels() dispatches the /api/ps request and then returns immediately. Any caller that continues tearing services down can exit or stop Ollama before the /api/generate unload POSTs are sent, so this new GPU cleanup becomes race-prone and often a no-op. Please return a promise that settles after the list request completes and all unload requests finish, then await it from the stop path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-ollama-proxy.ts` around lines 497 - 548, unloadOllamaModels currently fires the /api/ps GET and spawns /api/generate POSTs but returns immediately; change unloadOllamaModels to return a Promise that resolves after the /api/ps response is fully parsed and all unload POSTs have completed (or rejected on fatal error). Implement by wrapping the initial http.get call in a Promise, collecting per-unload request Promises (resolve on 'finish' or 'end' and reject on 'error' or timeout) for each unloadReq created in the loop, and resolve the main Promise only after Promise.all of those unload Promises settles; keep the best-effort semantics by handling individual unload errors but ensure the returned Promise only settles when all unload attempts have finished so callers can await it.

coderabbitai · 2026-05-01T09:03:07Z

+function applyStateDirLockMode(sandboxName: string, configDir: string, owner: string): void {
+  for (const dirName of HIGH_RISK_STATE_DIRS) {
+    const dirPath = `${configDir}/${dirName}`;
+    try {
+      kubectlExec(sandboxName, ["chown", "-R", owner, dirPath]);
+    } catch {
+      // Directory may not exist for this agent — silently skip
+    }
+    try {
+      kubectlExec(sandboxName, ["chmod", "755", dirPath]);
+      kubectlExec(sandboxName, ["chmod", "-R", "go-w", dirPath]);
+    } catch {
+      // Silently skip
+    }
+  }
+
+  // Multi-agent OpenClaw workspaces are named workspace-<agent>. They are
+  // discovered dynamically because they are configured by openclaw.json.
+  try {
+    kubectlExec(sandboxName, [
+      "sh",
+      "-c",
+      `
+set -u
+config_dir="$1"
+owner="$2"
+for dir in "$config_dir"/workspace-*; do
+  [ -d "$dir" ] || continue
+  chown -R "$owner" "$dir" 2>/dev/null || true
+  chmod 755 "$dir" 2>/dev/null || true
+  chmod -R go-w "$dir" 2>/dev/null || true
+done
+`,
+      "sh",
+      configDir,
+      owner,
+    ]);
+  } catch {
+    // Best effort; verification below catches the primary config lock.
+  }
+}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Verify the high-risk state directories instead of silently skipping failures.

applyStateDirLockMode() swallows every kubectlExec failure, and neither caller checks whether the directories that actually exist ended up with the requested owner/mode. If one of these operations fails, shields up can still persist a locked state while skills, hooks, workspace, or other NC-2227-05 paths remain writable. Please bubble up failures for existing directories, or add a post-check before saving state.

coderabbitai · 2026-05-01T09:03:07Z

+function probeGatewayRunning(): boolean {
+  const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
+  const result = dockerInspect(
+    ["--type", "container", "--format", "{{.State.Running}}", container],
+    { ignoreError: true, suppressOutput: true },
+  );
+  return result.status === 0 && String(result.stdout || "").trim() === "true";
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Locate dockerInspect wrapper and confirm supported options (timeout, suppressOutput)." rg -n -C4 '\bdockerInspect\b' src lib echo echo "Show callsites using dockerInspect to compare option shapes." rg -n -C2 '\bdockerInspect\s*\(' src

Repository: NVIDIA/NemoClaw

Length of output: 4032

🏁 Script executed:

cat -n src/lib/docker/run.ts | head -100

Repository: NVIDIA/NemoClaw

Length of output: 871

🏁 Script executed:

rg -n 'OPENSHELL_PROBE_TIMEOUT|timeout.*MS|TIMEOUT_MS' src lib

Repository: NVIDIA/NemoClaw

Length of output: 4738

🏁 Script executed:

rg -n 'type DockerRunOptions|interface DockerRunOptions' src lib

Repository: NVIDIA/NemoClaw

Length of output: 797

🏁 Script executed:

rg -n 'export.*function run|export.*const run' src/lib/runner --max-count 5

Repository: NVIDIA/NemoClaw

Length of output: 114

🏁 Script executed:

cat -n src/lib/runner.ts | head -100

Repository: NVIDIA/NemoClaw

Length of output: 3782

🏁 Script executed:

cat -n src/lib/runner.ts | sed -n '85,200p'

Repository: NVIDIA/NemoClaw

Length of output: 5146

🏁 Script executed:

rg -n 'probeGatewayRunning' src/nemoclaw.ts

Repository: NVIDIA/NemoClaw

Length of output: 188

🏁 Script executed:

cat -n src/nemoclaw.ts | sed -n '3850,3925p'

Repository: NVIDIA/NemoClaw

Length of output: 3943

Add a timeout to the gateway probe.

Line 3840 introduces a blocking dockerInspect call without an explicit timeout. If Docker is unhealthy, snapshot create/restore commands can hang indefinitely instead of failing fast. The constant OPENSHELL_PROBE_TIMEOUT_MS is already imported and widely used in the codebase for similar probes.

Suggested patch

function probeGatewayRunning(): boolean { const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`; const result = dockerInspect( ["--type", "container", "--format", "{{.State.Running}}", container], - { ignoreError: true, suppressOutput: true }, + { ignoreError: true, suppressOutput: true, timeout: OPENSHELL_PROBE_TIMEOUT_MS }, ); return result.status === 0 && String(result.stdout || "").trim() === "true"; }

Run recommended E2E suites before merge: sandbox-survival-e2e, sandbox-operations-e2e.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function probeGatewayRunning(): boolean {

const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;

const result = dockerInspect(

["--type", "container", "--format", "{{.State.Running}}", container],

{ ignoreError: true, suppressOutput: true },

);

return result.status === 0 && String(result.stdout || "").trim() === "true";

}

function probeGatewayRunning(): boolean {

const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;

const result = dockerInspect(

["--type", "container", "--format", "{{.State.Running}}", container],

{ ignoreError: true, suppressOutput: true, timeout: OPENSHELL_PROBE_TIMEOUT_MS },

);

return result.status === 0 && String(result.stdout || "").trim() === "true";

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 3838 - 3845, The gateway probe uses dockerInspect in probeGatewayRunning and lacks a timeout; modify the dockerInspect invocation to pass the shared OPENSHELL_PROBE_TIMEOUT_MS as the operation timeout (alongside the existing ignoreError and suppressOutput options) so the call fails fast when Docker is unresponsive; update the dockerInspect call in probeGatewayRunning to include the timeout option and keep the existing return logic based on result.status and result.stdout.

coderabbitai · 2026-05-01T09:03:08Z

+      if (!probeGatewayRunning()) {
        console.error("  Failed to query live sandbox state from OpenShell.");
        process.exit(1);
      }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify snapshot failure messaging for Docker-probe failures.

Lines 3853 and 3918 say OpenShell query failed, but this branch is triggered by the Docker container probe. The current message can misdirect troubleshooting.

Also applies to: 3917-3920

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 3852 - 3855, The error message printed when the Docker container probe fails is misleadingly saying "Failed to query live sandbox state from OpenShell"; update the messaging in the probe failure branches (the check that calls probeGatewayRunning() around the if block and the similar branch at lines handling the Docker probe near 3917-3920) to clearly indicate a Docker probe/container health check failure (e.g., mention "Docker probe" or "container probe" and the probe function name probeGatewayRunning) so operators know the failure originated from the Docker/container probe rather than an OpenShell query; modify both branches to use consistent, explicit text and include any available probe error details if present.

CodeRabbit review on NVIDIA#2696 flagged that the optimized build context (sandbox-build-context.ts) whitelists individual scripts but was missing scripts/rcf_patch.py. Since the Dockerfile COPYs it before the patching RUN step, optimized nemoclaw <name> rebuild runs would fail with 'file not found in build context'. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

BenediktSchackenberg · 2026-05-01T19:32:37Z

Fixed the critical one: added scripts/rcf_patch.py to the optimized build context in sandbox-build-context.ts — without it, nemoclaw <name> rebuild would fail with 'file not found in build context' on optimized builds.

The other 9 comments are on files outside this PR's diff (upstream changes pulled in during rebase). Scoping this PR to only the Dockerfile/rcf_patch.py change.

…ness Copilot review on NVIDIA#2696 flagged that even the regex-based inline python one-liner is still fragile and hard to maintain. Extract the patch logic to scripts/rcf_patch.py which uses a broad DOTALL regex anchored on the function-call names rather than exact whitespace, making it resilient to formatting changes across OpenClaw versions. Per Copilot review on PR NVIDIA#2696. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

…aceConfigFile Two issues from review on PR NVIDIA#2696: 1. SCRIPT_DIR is not defined in the Docker build context so the previous call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail. Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py before the RUN block and call it via the absolute in-image path. 2. The regex in rcf_patch.py could still miss due to ! await spacing and was not scoped to the replaceConfigFile function body. Fix: scope the match to the function body (walk braces to find it), and make all token separators whitespace-tolerant (! \s* await, \(\s*\{, ,\s*\{, optional trailing semicolon). Per CodeRabbit/Copilot review on PR NVIDIA#2696. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

CodeRabbit review on NVIDIA#2696 flagged that the optimized build context (sandbox-build-context.ts) whitelists individual scripts but was missing scripts/rcf_patch.py. Since the Dockerfile COPYs it before the patching RUN step, optimized nemoclaw <name> rebuild runs would fail with 'file not found in build context'. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

ericksoa · 2026-05-02T01:14:35Z

Thanks for driving this. I checked the helper against the current openclaw@2026.4.24 dist shape, and this does address the reported failure path.

I found one follow-on robustness gap: the helper is whitespace-tolerant, but not actually property-order-tolerant for the tryWriteSingleTopLevelIncludeMutation({ snapshot, nextConfig }) object. I do not think that needs to block this PR once the current CI/DCO items are fixed, because the pinned 2026.4.24 package uses the order this helper matches.

I opened #2875 to track that as future hardening, and you are very welcome to pick it up if you agree with the feedback.

ericksoa · 2026-05-02T01:36:52Z

@BenediktSchackenberg I opened #2876 to replay this change onto a fresh branch and get it through the current gates. I kept the implementation based on your #2696 work, included you as co-author in the commit, and credited the original PR in the replacement PR body.

Thanks again for digging into the OpenClaw 2026.4.24 patch failure. This should preserve that fix while resolving the DCO/hadolint blocker and folding in the small property-order hardening follow-up.

Replay the intent of #2696 by moving the OpenClaw replaceConfigFile patch into a staged helper script, then keep the Dockerfile patch path available in optimized rebuild contexts. Harden the helper scanner so braces in quoted strings or comments cannot terminate replaceConfigFile body detection early, and add focused regression coverage for property-order drift, comment/string braces, and fail-close behavior. Fixes #2689. Closes #2875. Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…aceConfigFile Two issues from review on PR NVIDIA#2696: 1. SCRIPT_DIR is not defined in the Docker build context so the previous call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail. Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py before the RUN block and call it via the absolute in-image path. 2. The regex in rcf_patch.py could still miss due to ! await spacing and was not scoped to the replaceConfigFile function body. Fix: scope the match to the function body (walk braces to find it), and make all token separators whitespace-tolerant (! \s* await, \(\s*\{, ,\s*\{, optional trailing semicolon). Per CodeRabbit/Copilot review on PR NVIDIA#2696. Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

Stress-test against 5 historical NemoClaw cases surfaced two real spec gaps that produced incorrect classifications: 1. Evidence-required filter was too strict for partial-fix detection. The LLM couldn't cite "PR diff line" when the gap is what the PR did NOT touch. Extended evidence to three citation shapes: - DIRECT (original): cite specific PR line + issue symptom - BY-OMISSION: cite PR scope + issue symptom showing same bug class in different instances PR did not touch (catches partial-fix) - FOLLOW-ON: cite the symbol/file PR introduced + issue's request to harden the same code (catches "PR introduced X, now harden X") 2. ADJACENT_FIX wording assumed the issue was already broken and the PR resolves it. Reframed to also include "PR opens a clear follow-on path on the same code the PR just touched" — covers the common pattern where an issue requests hardening of code a PR just introduced. Test cases that drove these patches: - #2700 → #2762 (partial-fix, was failing — now passes via by-omission) - #2696 → #2875 (follow-on hardening, was ambiguous — now explicit)

wscurran · 2026-05-08T18:34:31Z

Closing as superseded by #2876. That PR replayed this Dockerfile replaceConfigFile hardening onto a fresh branch, included @BenediktSchackenberg as co-author, resolved the gate blockers, and merged on 2026-05-02. Thanks again for digging into the OpenClaw 2026.4.24 patch failure.

Copilot AI review requested due to automatic review settings April 29, 2026 19:15

Copilot started reviewing on behalf of BenediktSchackenberg April 29, 2026 19:15 View session

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Copilot AI reviewed Apr 29, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread scripts/rcf_patch.py

wscurran added Docker dependencies Pull requests that update a dependency file labels Apr 30, 2026

BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 9413f24 to b0edc0f Compare May 1, 2026 07:18

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 9836c9d to 6427855 Compare May 1, 2026 20:28

ericksoa mentioned this pull request May 2, 2026

Harden rcf_patch.py against property-order drift in OpenClaw replaceConfigFile #2875

Closed

ericksoa mentioned this pull request May 2, 2026

fix: harden Dockerfile replaceConfigFile patch #2876

Merged

BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 2c65003 to df82a7c Compare May 2, 2026 14:21

wscurran closed this May 8, 2026

wscurran added area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026

		export function getResolvedOllamaHost(): string {
		return _resolvedOllamaHost ?? OLLAMA_LOCALHOST;

Conversation

BenediktSchackenberg commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

robverbeek66 commented Apr 29, 2026

Uh oh!

BenediktSchackenberg commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Apr 30, 2026

Uh oh!

wscurran commented Apr 30, 2026

Uh oh!

BenediktSchackenberg commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

BenediktSchackenberg commented May 1, 2026

Uh oh!

ericksoa commented May 2, 2026

Uh oh!

ericksoa commented May 2, 2026

Uh oh!

wscurran commented May 8, 2026

BenediktSchackenberg commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading