Skip to content

fix: make Dockerfile replaceConfigFile patch robust for 2026.4.24#2696

Closed
BenediktSchackenberg wants to merge 1 commit into
NVIDIA:mainfrom
BenediktSchackenberg:fix/dockerfile-openclaw-version-2689
Closed

fix: make Dockerfile replaceConfigFile patch robust for 2026.4.24#2696
BenediktSchackenberg wants to merge 1 commit into
NVIDIA:mainfrom
BenediktSchackenberg:fix/dockerfile-openclaw-version-2689

Conversation

@BenediktSchackenberg

@BenediktSchackenberg BenediktSchackenberg commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • update the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex around the tryWriteSingleTopLevelIncludeMutation + writeConfigFile(params.nextConfig) block
  • keep the same EACCES sandbox handling behavior, but avoid brittle exact-string matching

Why

The bundled OpenClaw 2026.4.24 code path is correct, but minor formatting differences in compiled dist files can make strict literal matching fail with:

writeConfigFile(params.nextConfig) pattern not found

This change keeps the patch aligned with the 2026.4.24 API while making rebuilds resilient to formatting drift.

Summary by CodeRabbit

  • New Features

    • Agent-branded CLI names; new list (JSON) and share (mount/unmount/status) commands; structured sandbox inventory shows agent and optional live inference.
  • Bug Fixes

    • Stronger onboarding validations (agent binary/health, Dockerfile --from), better local/provider probing with retries/diagnostics, improved GPU reporting and TLS mismatch detection.
  • Chores

    • Build/runtime adjustments: safer image patching, preload fix, startup gateway token, unified sandbox config layout, added timeouts and Hugging Face credential passthrough.

Copilot AI review requested due to automatic review settings April 29, 2026 19:15
@copy-pr-bot

copy-pr-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors OpenClaw runtime/build handling (new rcf_patch helper, ws-proxy preload, token handling, legacy layout cleanup), consolidates agent config paths to a single dir, introduces branding/CLI renames, adds oclif list/share commands and runner, improves local inference/Ollama handling, and broad changes to shields/state, timeouts, validation, and many tests.

Changes

Cohort / File(s) Summary
Docker + runtime image
Dockerfile, ws-proxy-fix.js, scripts/rcf_patch.py
Replaced inline python patch with scripts/rcf_patch.py; include and chmod ws-proxy-fix.js; clear build-time gateway token in openclaw.json; flatten/remove legacy symlink layouts and adjust ownership/permissions for sandbox config files.
Installer / bootstrap script
scripts/install.sh
Agent-aware branding and shim logic; vLLM install/start support; Hermes forward recovery watcher; generalized CLI shim/verification and clone/openshell selection changes.
Package metadata
package.json
Added runtime dependency @oclif/core ^4.11.0.
Branding & CLI naming
src/lib/branding.ts, src/nemoclaw.ts, src/lib/build-context.ts, src/lib/command-registry.ts, src/lib/command-registry.test.ts
New branding module and constants; replace hardcoded nemoclaw strings with branding-aware CLI_NAME/CLI_DISPLAY_NAME/AGENT_PRODUCT_NAME; add brandedUsage() and adjust help/recovery text.
Agent config shape migration
src/lib/agent-defs.ts, src/lib/agent-defs.test.ts, src/lib/agent-runtime.ts, src/lib/agent-runtime.test.ts, src/lib/agent-onboard.ts, src/lib/agent-onboard.test.ts, src/lib/skill-install.ts, src/lib/skill-install.test.ts, src/lib/sandbox-config.ts, src/lib/sandbox-state.ts, src/lib/sandbox-version.test.ts
Consolidated immutableDir/writableDir into single dir across manifests, code, and tests; removed config mutation/rotate features; updated backup/restore and adjusted ownership/permission semantics.
Shields & locking logic
src/lib/shields.ts, src/lib/shields-timer.ts
Reworked shields into three-mode model, removed permanent shields-down, enhanced unlock/lock verification to resolve agent config and sensitive files, and tightened post-op checks.
Onboarding & Ollama/model handling
src/lib/onboard.ts, src/lib/onboard-inference-probes.ts, src/lib/onboard-ollama-proxy.ts, src/lib/onboard-command.ts
Branding-aware onboarding, removed dangerously-skip-permissions, added build-context validation/denylist, TLS cert-mismatch classification, agent drift handling, HTTP-streamed Ollama pulls, proxy health/unload, and probe retry wrapper for 429 throttling.
Local inference & diagnostics
src/lib/local-inference.ts, src/lib/local-inference.test.ts
WSL-aware Ollama host discovery and caching, retries/backoff for Docker reachability with diagnostics, explicit curl timeouts, and exported helpers for host resolution and diagnostics.
Oclif commands & runner
src/lib/oclif-commands.ts, src/lib/oclif-runner.ts, src/lib/list-command.ts, src/lib/list-command-deps.ts
Add programmatic oclif runner, list command, deps builder for list (live inference capture, active session counting), and command registry wiring.
Share command & deps
src/lib/share-command.ts, src/lib/share-command-deps.ts
New share oclif command implementing `mount
Inventory / list refactor
src/lib/inventory-commands.ts, src/lib/inventory-commands.test.ts
Add structured inventory types, getSandboxInventory, and renderSandboxInventoryText; refactor list/status to use these and include agent column and optional live inference.
Validation, timeouts & resolve helpers
src/lib/validation.ts, src/lib/validation.test.ts, src/lib/openshell-timeouts.ts, src/lib/openshell-timeouts.test.ts, src/lib/resolve-openshell.ts, src/lib/resolve-openshell.test.ts
Add tls_cert_mismatch classification, centralized openshell timeout constants with tests, and env override for openshell binary resolution.
Credentials & deploy
src/lib/credentials.ts, src/lib/deploy.ts, src/lib/deploy.test.ts
Allow HF_TOKEN and HUGGING_FACE_HUB_TOKEN as known credential keys and propagate them through DeployCredentials and build env lines.
NIM GPU detection
src/lib/nim.ts, src/lib/nim.test.ts
Parse per-GPU memory and model names robustly; only set aggregated name when GPUs share the same model; add tests for edge cases.
Commands, registry & misc services
src/lib/command-registry.ts, src/lib/registry.ts, src/lib/services.ts, src/lib/policies.ts
Add share subcommands; remove hidden sandbox config commands and dangerouslySkipPermissions persistence; unload Ollama models on stop/destroy; update banners to branding.
Build context & staged assets
src/lib/sandbox-build-context.ts
Include scripts/rcf_patch.py and additional assets in staged build context; make collectBuildContextStats accept an include predicate.
Many tests updated
multiple src/lib/*.test.ts
Widespread test updates to reflect branding, config-path shape, preload guards, NODE_OPTIONS checks, changed model IDs, and other behavioral changes.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI
  participant Oclif as Oclif Runner
  participant Deps as ListCommandDeps
  participant Openshell
  participant Renderer
  rect rgba(200,200,255,0.5)
  User->>CLI: run `cli list`
  CLI->>Oclif: runRegisteredOclifCommand("list", args)
  Oclif->>Deps: buildListCommandDeps()
  Deps->>Openshell: probe sandboxes / capture live inference
  Openshell-->>Deps: sandbox list + inference JSON
  Deps-->>Oclif: inventory result
  Oclif->>Renderer: renderSandboxInventoryText(inventory, liveInference)
  Renderer-->>User: formatted output
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped into the build and stitched a patch,
A helper script to fix a tricky match.
I folded old paths into one neat nest,
Branded the CLI and gave commands their best.
Now sandboxes hum — time for carrot-snack! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: making the Dockerfile's replaceConfigFile patch more robust by using a whitespace-tolerant regex approach for OpenClaw 2026.4.24, which directly addresses the core problem described in the objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

152-217: Please gate this Dockerfile patch block with the recommended E2E suite.

Because this mutates bundled OpenClaw dist at image build time, run cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e before merge.

As per coding guidelines, "Dockerfile: This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 152 - 217, This Dockerfile introduces
runtime-altering patches to OpenClaw (see OC_DIST and patched symbols like
withStrictGuardedFetchMode → withTrustedEnvProxyGuardedFetchMode,
assertExplicitProxyAllowed, replaceConfigFile, and
DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS), so gate the PR with the E2E suites before
merge: add required CI status checks or a pre-merge pipeline that runs
cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e and block
merging until they pass (ensure the check names match these suites and attach
them to this Dockerfile change/patch block).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 152-217: This Dockerfile introduces runtime-altering patches to
OpenClaw (see OC_DIST and patched symbols like withStrictGuardedFetchMode →
withTrustedEnvProxyGuardedFetchMode, assertExplicitProxyAllowed,
replaceConfigFile, and DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS), so gate the PR
with the E2E suites before merge: add required CI status checks or a pre-merge
pipeline that runs cloud-e2e, sandbox-survival-e2e, hermes-e2e, and
rebuild-openclaw-e2e and block merging until they pass (ensure the check names
match these suites and attach them to this Dockerfile change/patch block).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 331ef204-4746-4742-973c-b28b69576d11

📥 Commits

Reviewing files that changed from the base of the PR and between d753c18 and 06a58f1.

📒 Files selected for processing (1)
  • Dockerfile

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates NemoClaw’s Docker build-time patching of bundled OpenClaw dist code so the replaceConfigFile patch is less sensitive to formatting differences in OpenClaw 2026.4.24, reducing rebuild failures when the compiled JS layout changes.

Changes:

  • Replace a brittle exact-string match for the tryWriteSingleTopLevelIncludeMutationwriteConfigFile(params.nextConfig) block with a regex-based substitution.
  • Preserve the existing sandbox behavior: catch EACCES when OPENSHELL_SANDBOX=1 and emit a warning instead of crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Dockerfile Outdated
rcf_file="$(grep -RIlE --include='*.js' 'async function replaceConfigFile\(params\)' "$OC_DIST" | head -n 1)"; \
test -n "$rcf_file" || { echo "ERROR: replaceConfigFile function not found in OpenClaw dist" >&2; exit 1; }; \
python3 -c "import sys; p=sys.argv[1]; f=open(p); src=f.read(); f.close(); old='\tif (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t});'; new='\ttry { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; assert old in src, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; f=open(p,'w'); f.write(src.replace(old,new,1)); f.close()" "$rcf_file"; \
python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if \(!await tryWriteSingleTopLevelIncludeMutation\(\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}\)\) await writeConfigFile\(params\.nextConfig, \{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}\);'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new regex pattern still hard-codes some literal spaces (e.g., if \( and , \{) and requires a trailing });, so it can still fail with minor formatting/minification drift (the same problem this PR is trying to solve). Make the pattern fully whitespace-tolerant around tokens like if, (, {, commas, and allow an optional semicolon (e.g., use if\s*\(, \(\s*\{, ,\s*\{, and \}\)\s*;?).

Suggested change
python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if \(!await tryWriteSingleTopLevelIncludeMutation\(\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}\)\) await writeConfigFile\(params\.nextConfig, \{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}\);'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \
python3 -c "import re,sys; p=sys.argv[1]; src=open(p).read(); pat=r'if\s*\(!await tryWriteSingleTopLevelIncludeMutation\(\s*\{\s*snapshot,\s*nextConfig: params\.nextConfig\s*\}\s*\)\s*\)\s*await writeConfigFile\(params\.nextConfig,\s*\{\s*baseSnapshot: snapshot,\s*\.\.\.writeOptions,\s*\.\.\.params\.writeOptions\s*\}\)\s*;?'; repl='try { if (!await tryWriteSingleTopLevelIncludeMutation({\n\t\tsnapshot,\n\t\tnextConfig: params.nextConfig\n\t})) await writeConfigFile(params.nextConfig, {\n\t\tbaseSnapshot: snapshot,\n\t\t...writeOptions,\n\t\t...params.writeOptions\n\t}); } catch(_rcfErr) { if (process.env.OPENSHELL_SANDBOX === \"1\" && _rcfErr.code === \"EACCES\") { console.error(\"[nemoclaw] Config is read-only in sandbox \\u2014 plugin metadata not persisted (plugins auto-load from extensions/)\"); } else { throw _rcfErr; } }'; out,n=re.subn(pat,repl,src,count=1,flags=re.S); assert n==1, 'tryWriteSingleTopLevelIncludeMutation/writeConfigFile pattern not found in replaceConfigFile'; open(p,'w').write(out)" "$rcf_file"; \

Copilot uses AI. Check for mistakes.
@robverbeek66

Copy link
Copy Markdown

Confirming this exact failure on a fresh nemoclaw onboard (Ubuntu 24.04, OpenShell 0.0.36, OpenClaw 2026.4.24 in base image). The build fails at Step 17/56 with the same writeConfigFile(params.nextConfig) pattern not found assertion. Reproduces with both OpenAI (gpt-5.4-mini) and local Ollama (qwen2.5:14b) provider selections — the failure is upstream of provider configuration. Looking forward to seeing this merged.

BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request Apr 29, 2026
…ness

Copilot review on NVIDIA#2696 flagged that even the regex-based inline python
one-liner is still fragile and hard to maintain. Extract the patch logic
to scripts/rcf_patch.py which uses a broad DOTALL regex anchored on the
function-call names rather than exact whitespace, making it resilient to
formatting changes across OpenClaw versions.

Per Copilot review on PR NVIDIA#2696.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg

Copy link
Copy Markdown
Contributor Author

Per Copilot review — extracted the patch logic to scripts/rcf_patch.py instead of keeping it as an inline Python one-liner in the Dockerfile. The external script uses a broad DOTALL regex anchored on the function-call names (tryWriteSingleTopLevelIncludeMutation, writeConfigFile) rather than exact whitespace or argument ordering, so minor formatting drift across OpenClaw versions won't cause false misses.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/rcf_patch.py`:
- Around line 16-23: The regex is too broad and spacing-sensitive: change the
search to first extract only the replaceConfigFile function body (anchor on the
function declaration name replaceConfigFile) and then run the pattern against
that substring, and relax the "!await" match to allow whitespace (use
"!\\s*await" instead of "!await") so it matches variants with spaces; keep the
rest of the pattern targeting the tryWriteSingleTopLevelIncludeMutation and
writeConfigFile call sites to ensure only the intended block is patched.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c35d3a1a-de02-4523-ac54-a8d86bd88f81

📥 Commits

Reviewing files that changed from the base of the PR and between 06a58f1 and c59803c.

📒 Files selected for processing (2)
  • Dockerfile
  • scripts/rcf_patch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile

Comment thread scripts/rcf_patch.py
@wscurran wscurran added Docker dependencies Pull requests that update a dependency file labels Apr 30, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that proposes an update to the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex. This change aims to improve the robustness of the patch and make rebuilds more resilient to formatting drift.

1 similar comment
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that proposes an update to the Dockerfile patch for replaceConfigFile to use a whitespace-tolerant regex. This change aims to improve the robustness of the patch and make rebuilds more resilient to formatting drift.

BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request May 1, 2026
…aceConfigFile

Two issues from review on PR NVIDIA#2696:

1. SCRIPT_DIR is not defined in the Docker build context so the previous
   call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail.
   Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py
   before the RUN block and call it via the absolute in-image path.

2. The regex in rcf_patch.py could still miss due to ! await spacing and
   was not scoped to the replaceConfigFile function body.
   Fix: scope the match to the function body (walk braces to find it),
   and make all token separators whitespace-tolerant (! \s* await, \(\s*\{,
   ,\s*\{, optional trailing semicolon).

Per CodeRabbit/Copilot review on PR NVIDIA#2696.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 9413f24 to b0edc0f Compare May 1, 2026 07:18
@BenediktSchackenberg

Copy link
Copy Markdown
Contributor Author

Two fixes pushed:

  1. Added COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py before the RUN block and updated the call to use the absolute in-image path — SCRIPT_DIR was never defined in the build context
  2. Scoped the regex to the replaceConfigFile function body (walks braces to find it) and made all token separators whitespace-tolerant including ! await spacing and optional trailing semicolon

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

152-198: Please run the Dockerfile E2E matrix before merge.

Given this changes image-layer patching behavior, run: cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e on this branch.

As per coding guidelines, "Dockerfile: This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 152 - 198, This change alters Dockerfile image-layer
patching (patches touching OC_DIST, rcf_patch.py, replaceConfigFile,
OPENSHELL_SANDBOX behavior and install-safe-path/install-package-dir edits) and
must be validated by running the full Dockerfile E2E matrix before merging: run
cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e against
this branch (build the container to exercise the sed/python patches and check
replaceConfigFile EACCES warning, install-safe-path/stat and install-package-dir
symlink behavior), capture logs/artifacts for any failures, and post the passing
results or failing test output to the PR for reviewer signoff.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 152-198: This change alters Dockerfile image-layer patching
(patches touching OC_DIST, rcf_patch.py, replaceConfigFile, OPENSHELL_SANDBOX
behavior and install-safe-path/install-package-dir edits) and must be validated
by running the full Dockerfile E2E matrix before merging: run cloud-e2e,
sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e against this branch
(build the container to exercise the sed/python patches and check
replaceConfigFile EACCES warning, install-safe-path/stat and install-package-dir
symlink behavior), capture logs/artifacts for any failures, and post the passing
results or failing test output to the PR for reviewer signoff.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5779a2a5-88a1-490d-a48b-332e1515c7f5

📥 Commits

Reviewing files that changed from the base of the PR and between c59803c and b0edc0f.

📒 Files selected for processing (2)
  • Dockerfile
  • scripts/rcf_patch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/rcf_patch.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/lib/services.ts (1)

304-324: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the banner within the box width.

The messaging line is longer than the current 52-character width once AGENT_PRODUCT_NAME is inserted, so the right border will drift and the ASCII box will break.

Either widen the box or compute the banner width from the longest rendered line.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.ts` around lines 304 - 324, The banner's fixed pad width
(padEnd(52)) in the services display causes the right border to misalign when
AGENT_PRODUCT_NAME makes the line longer; update the rendering to compute the
box width dynamically from the longest rendered line (use the strings built for
CLI_DISPLAY_NAME + " Services", the Public URL line when tunnelUrl exists, and
the Messaging line that concatenates AGENT_PRODUCT_NAME) and then pad each line
to that computed width (instead of hardcoded 52) before printing the top/bottom
borders and side bars; adjust the padEnd calls and border strings accordingly so
CLI_DISPLAY_NAME, AGENT_PRODUCT_NAME, and tunnelUrl are all referenced in the
width calculation and stay within the box.
src/lib/onboard.ts (2)

3658-3666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard malformed CHAT_UI_URL in the second parse path.

If CHAT_UI_URL is malformed, the first parse is safely caught, but this second parse can still throw and abort sandbox creation unexpectedly.

Suggested patch
   if (process.env.CHAT_UI_URL && controlUiPort == null) {
-    const parsed = new URL(
-      process.env.CHAT_UI_URL.includes("://")
-        ? process.env.CHAT_UI_URL
-        : `http://${process.env.CHAT_UI_URL}`,
-    );
-    parsed.port = String(effectivePort);
-    chatUiUrl = parsed.toString().replace(/\/$/, "");
+    try {
+      const parsed = new URL(
+        process.env.CHAT_UI_URL.includes("://")
+          ? process.env.CHAT_UI_URL
+          : `http://${process.env.CHAT_UI_URL}`,
+      );
+      parsed.port = String(effectivePort);
+      chatUiUrl = parsed.toString().replace(/\/$/, "");
+    } catch {
+      console.warn("  Invalid CHAT_UI_URL; falling back to localhost dashboard URL.");
+      chatUiUrl = `http://127.0.0.1:${effectivePort}`;
+    }
   } else {
     chatUiUrl = `http://127.0.0.1:${effectivePort}`;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3658 - 3666, The second URL construction
using new URL(...) when process.env.CHAT_UI_URL lacks "://" can throw on
malformed values; update the block that sets parsed (the branch where
controlUiPort == null) to validate or wrap the parse in a try/catch (around new
URL and subsequent parsed.port assignment) and handle failures by falling back
to a safe default or skipping chatUiUrl assignment so sandbox creation doesn't
abort; reference the variables/idents involved: process.env.CHAT_UI_URL,
controlUiPort, parsed, effectivePort, and chatUiUrl.

1545-1545: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Broaden ARG detection to support valid Dockerfile syntax variants.

agentSupportsWebSearch() currently only matches ARG NEMOCLAW_WEB_SEARCH_ENABLED=.... A Dockerfile with ARG NEMOCLAW_WEB_SEARCH_ENABLED (no default) will be treated as unsupported even though the ARG is present.

Suggested patch
-      return /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m.test(content);
+      return /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?\s*$/m.test(content);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` at line 1545, The function agentSupportsWebSearch()
currently only detects ARG lines that include an equals sign; update its regex
to also match the valid Dockerfile form without a default value. Replace the
current /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m test with a pattern that
allows either an equals sign with an optional value or the end of the token (for
example matching /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?/m or
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=|$)/m), ensuring whitespace is
tolerated; keep the change localized to agentSupportsWebSearch().
🧹 Nitpick comments (4)
src/lib/sandbox-state.ts (1)

559-566: Run the sandbox-state E2Es before merge.

Given these lifecycle-path changes, run snapshot-commands-e2e and rebuild-openclaw-e2e on this branch to validate backup/restore behavior end-to-end.

As per coding guidelines, src/lib/sandbox-state.ts: “E2E test recommendation: snapshot-commands-e2e — snapshot create/list/restore lifecycle; rebuild-openclaw-e2e — workspace state survives rebuild.”

Also applies to: 803-815

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sandbox-state.ts` around lines 559 - 566, Run the end-to-end tests to
validate backup/restore behavior for the lifecycle-path changes: on this branch
execute the snapshot-commands-e2e (snapshot create/list/restore lifecycle) and
rebuild-openclaw-e2e (workspace state survives rebuild) suites and confirm they
pass; focus validation around the backup/restore flow exercised by
backupSandboxState (and the related code around lines ~803-815) to ensure
sandbox state backup and restore work correctly after your changes before
merging.
Dockerfile (1)

371-495: Please run the Dockerfile E2E matrix for this change set.

These image-layer and runtime-permission changes should be validated via cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e before merge.

As per coding guidelines, Dockerfile: “E2E test recommendation: cloud-e2e, sandbox-survival-e2e, hermes-e2e, rebuild-openclaw-e2e.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 371 - 495, The PR touches Dockerfile image-layer and
runtime-permission logic (the large RUN migration block and subsequent RUN
chown/chmod and sha256sum steps) but the reviewer requests you run the required
E2E matrices before merging; please execute cloud-e2e, sandbox-survival-e2e,
hermes-e2e, and rebuild-openclaw-e2e against this image change, capture and
attach the failing/passing job logs, and only merge if all pass (or document any
accepted failures with rationale), referencing the Dockerfile RUN migration
block, the chown/chmod lines, and the .config-hash creation so reviewers can
correlate test results to the exact changes.
src/nemoclaw.ts (1)

3835-3922: Run the sandbox lifecycle E2E suite before merge.

Given lifecycle-touching changes in snapshot/gateway probing and recovery paths, run the recommended targeted nightly jobs pre-merge to reduce regression risk:
sandbox-survival-e2e, sandbox-operations-e2e, skip-permissions-e2e.

As per coding guidelines, for src/nemoclaw.ts: “E2E test recommendation: sandbox-survival-e2e, sandbox-operations-e2e, skip-permissions-e2e.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 3835 - 3922, The PR touches snapshot/gateway
probing and recovery (functions probeGatewayRunning and sandboxSnapshot) and per
the repo guidelines you must run the targeted E2E suites before merging: run
sandbox-survival-e2e, sandbox-operations-e2e, and skip-permissions-e2e
(nightly/targeted) and attach their green CI artifacts; if any test fails,
reproduce locally against your branch and fix regressions in
probeGatewayRunning/sandboxSnapshot before re-running tests. Ensure the test run
names and results are included in the PR description or CI status so reviewers
can confirm the required E2E suites passed.
src/lib/onboard.ts (1)

7685-8359: Run the targeted E2E matrix for this onboarding change set before merge.

Given the breadth of behavior changes in core onboarding, run cloud-e2e, sandbox-operations-e2e, and rebuild-openclaw-e2e on this branch to reduce regression risk around resume/reuse/rebuild paths.

As per coding guidelines: src/lib/onboard.ts is core onboarding logic and recommends these specific E2E jobs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 7685 - 8359, This change touches core
onboarding logic in function onboard (src/lib/onboard.ts), so before merging run
the targeted E2E matrix to catch regressions: execute cloud-e2e,
sandbox-operations-e2e, and rebuild-openclaw-e2e on this branch and verify
resume/reuse/rebuild paths (resume, gateway/sandbox reuse, inference/provider
selection, sandbox create/rebuild, and agent/openclaw setup) all pass; if any
fail, reproduce locally against the failing step in onboard (e.g.,
provider_selection, inference, sandbox, gateway, openclaw) and fix the failing
logic before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/install.sh`:
- Around line 320-362: The watcher is being launched and later treated as a
success even when curl is unavailable, causing false success and restart thrash;
guard watcher creation and the success check by ensuring curl is present: before
writing watcher_script and spawning the detached Node watcher (the block that
defines watcher_script and the node -e spawn), add a check that command_exists
curl is true and only then start the watcher, and remove or change the later
branch that returns 0 when curl is missing (the block that reads watcher_pid and
returns 0 if !command_exists curl && [[ -n "$watcher_pid" ]] ...) so we do not
treat a running watcher as healthy without curl. Use the symbols watcher_script,
tick/healthy (in the embedded script), the node -e spawn invocation, and
watcher_pid to locate and update the logic.

In `@src/lib/build-context.ts`:
- Around line 76-83: The TLS recovery hint prints a hard-coded gateway name
("nemoclaw") which mixes branding; in the failure.kind === "tls_cert_mismatch"
branch replace the literal "openshell gateway trust -g nemoclaw" string with one
that uses the CLI_NAME constant (e.g. use string interpolation so the command
reads `${CLI_NAME} gateway trust -g ${CLI_NAME}` or similar) so the trust hint
and the resume hint both use CLI_NAME consistently.

In `@src/lib/local-inference.ts`:
- Around line 79-80: getResolvedOllamaHost() currently returns
_resolvedOllamaHost or OLLAMA_LOCALHOST without attempting to resolve the actual
reachable host; update it to call findReachableOllamaHost() when
_resolvedOllamaHost is unset so WSL/docker cases (e.g., host.docker.internal)
are discovered before falling back to OLLAMA_LOCALHOST; specifically, inside
getResolvedOllamaHost() call findReachableOllamaHost(), await or handle its
result as appropriate, assign that value to _resolvedOllamaHost, and then return
_resolvedOllamaHost or OLLAMA_LOCALHOST as a final fallback.

In `@src/lib/nim.ts`:
- Around line 96-113: The perGpuMB field currently uses parsed[0].memoryMB which
misreports heterogeneous hosts; update the logic in the block that builds the
return object (using parsed: ParsedGpu[], allSameName, canRunNimWithMemory) to
only set perGpuMB when every GPU has the same memory (compute a new
allSameMemory check like parsed.every(p => p.memoryMB === parsed[0].memoryMB))
or else omit/leave perGpuMB undefined; leave totalMemoryMB and nimCapable as-is.

In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 497-548: unloadOllamaModels currently fires the /api/ps GET and
spawns /api/generate POSTs but returns immediately; change unloadOllamaModels to
return a Promise that resolves after the /api/ps response is fully parsed and
all unload POSTs have completed (or rejected on fatal error). Implement by
wrapping the initial http.get call in a Promise, collecting per-unload request
Promises (resolve on 'finish' or 'end' and reject on 'error' or timeout) for
each unloadReq created in the loop, and resolve the main Promise only after
Promise.all of those unload Promises settles; keep the best-effort semantics by
handling individual unload errors but ensure the returned Promise only settles
when all unload attempts have finished so callers can await it.

In `@src/lib/onboard.ts`:
- Around line 4582-4589: The vllm-local ambiguity is being collapsed too early
by providerNameToOptionKey (or the name === "vllm-local" mapping) because it
maps "vllm-local" to "vllm" whenever opts.hasNimContainer is falsy, which
prevents the downstream ambiguity branch from running during non-interactive
recovery; update the mapping logic (providerNameToOptionKey or the initial name
normalization) to preserve the "vllm-local" sentinel when recovery is
non-interactive or when ambiguity needs to be resolved (e.g., check
opts.recoveryNonInteractive or equivalent) so the later code that inspects
opts.hasNimContainer and the ambiguity branch can run, and only collapse
"vllm-local" to "vllm" when it is safe (explicit hasNimContainer true/false in
interactive flows) — adjust callers referencing name === "vllm-local" and
opts.hasNimContainer accordingly to handle the preserved sentinel.

In `@src/lib/sandbox-build-context.ts`:
- Around line 78-80: The optimized build context is missing scripts/rcf_patch.py
caused by the whitelist used when creating the optimized root-scripts; update
the whitelist logic in sandbox-build-context.ts (the code that copies from
sourceBlueprintDir to stagedBlueprintDir using fs.cpSync and path.join) to
include "scripts/rcf_patch.py" so the optimized build context contains that
file; locate the whitelist/allowedFiles array or function that filters
root-scripts and add "scripts/rcf_patch.py" (or broaden the scripts/* inclusion)
so the Dockerfile COPY for scripts/rcf_patch.py will find the file during
optimized builds.

In `@src/nemoclaw.ts`:
- Around line 3852-3855: The error message printed when the Docker container
probe fails is misleadingly saying "Failed to query live sandbox state from
OpenShell"; update the messaging in the probe failure branches (the check that
calls probeGatewayRunning() around the if block and the similar branch at lines
handling the Docker probe near 3917-3920) to clearly indicate a Docker
probe/container health check failure (e.g., mention "Docker probe" or "container
probe" and the probe function name probeGatewayRunning) so operators know the
failure originated from the Docker/container probe rather than an OpenShell
query; modify both branches to use consistent, explicit text and include any
available probe error details if present.
- Around line 3838-3845: The gateway probe uses dockerInspect in
probeGatewayRunning and lacks a timeout; modify the dockerInspect invocation to
pass the shared OPENSHELL_PROBE_TIMEOUT_MS as the operation timeout (alongside
the existing ignoreError and suppressOutput options) so the call fails fast when
Docker is unresponsive; update the dockerInspect call in probeGatewayRunning to
include the timeout option and keep the existing return logic based on
result.status and result.stdout.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3658-3666: The second URL construction using new URL(...) when
process.env.CHAT_UI_URL lacks "://" can throw on malformed values; update the
block that sets parsed (the branch where controlUiPort == null) to validate or
wrap the parse in a try/catch (around new URL and subsequent parsed.port
assignment) and handle failures by falling back to a safe default or skipping
chatUiUrl assignment so sandbox creation doesn't abort; reference the
variables/idents involved: process.env.CHAT_UI_URL, controlUiPort, parsed,
effectivePort, and chatUiUrl.
- Line 1545: The function agentSupportsWebSearch() currently only detects ARG
lines that include an equals sign; update its regex to also match the valid
Dockerfile form without a default value. Replace the current
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED=/m test with a pattern that allows either
an equals sign with an optional value or the end of the token (for example
matching /^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=.*)?/m or
/^\s*ARG\s+NEMOCLAW_WEB_SEARCH_ENABLED(?:\s*=|$)/m), ensuring whitespace is
tolerated; keep the change localized to agentSupportsWebSearch().

In `@src/lib/services.ts`:
- Around line 304-324: The banner's fixed pad width (padEnd(52)) in the services
display causes the right border to misalign when AGENT_PRODUCT_NAME makes the
line longer; update the rendering to compute the box width dynamically from the
longest rendered line (use the strings built for CLI_DISPLAY_NAME + " Services",
the Public URL line when tunnelUrl exists, and the Messaging line that
concatenates AGENT_PRODUCT_NAME) and then pad each line to that computed width
(instead of hardcoded 52) before printing the top/bottom borders and side bars;
adjust the padEnd calls and border strings accordingly so CLI_DISPLAY_NAME,
AGENT_PRODUCT_NAME, and tunnelUrl are all referenced in the width calculation
and stay within the box.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 371-495: The PR touches Dockerfile image-layer and
runtime-permission logic (the large RUN migration block and subsequent RUN
chown/chmod and sha256sum steps) but the reviewer requests you run the required
E2E matrices before merging; please execute cloud-e2e, sandbox-survival-e2e,
hermes-e2e, and rebuild-openclaw-e2e against this image change, capture and
attach the failing/passing job logs, and only merge if all pass (or document any
accepted failures with rationale), referencing the Dockerfile RUN migration
block, the chown/chmod lines, and the .config-hash creation so reviewers can
correlate test results to the exact changes.

In `@src/lib/onboard.ts`:
- Around line 7685-8359: This change touches core onboarding logic in function
onboard (src/lib/onboard.ts), so before merging run the targeted E2E matrix to
catch regressions: execute cloud-e2e, sandbox-operations-e2e, and
rebuild-openclaw-e2e on this branch and verify resume/reuse/rebuild paths
(resume, gateway/sandbox reuse, inference/provider selection, sandbox
create/rebuild, and agent/openclaw setup) all pass; if any fail, reproduce
locally against the failing step in onboard (e.g., provider_selection,
inference, sandbox, gateway, openclaw) and fix the failing logic before merging.

In `@src/lib/sandbox-state.ts`:
- Around line 559-566: Run the end-to-end tests to validate backup/restore
behavior for the lifecycle-path changes: on this branch execute the
snapshot-commands-e2e (snapshot create/list/restore lifecycle) and
rebuild-openclaw-e2e (workspace state survives rebuild) suites and confirm they
pass; focus validation around the backup/restore flow exercised by
backupSandboxState (and the related code around lines ~803-815) to ensure
sandbox state backup and restore work correctly after your changes before
merging.

In `@src/nemoclaw.ts`:
- Around line 3835-3922: The PR touches snapshot/gateway probing and recovery
(functions probeGatewayRunning and sandboxSnapshot) and per the repo guidelines
you must run the targeted E2E suites before merging: run sandbox-survival-e2e,
sandbox-operations-e2e, and skip-permissions-e2e (nightly/targeted) and attach
their green CI artifacts; if any test fails, reproduce locally against your
branch and fix regressions in probeGatewayRunning/sandboxSnapshot before
re-running tests. Ensure the test run names and results are included in the PR
description or CI status so reviewers can confirm the required E2E suites
passed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8508b574-b2f5-4515-88b1-320ecf0cb880

📥 Commits

Reviewing files that changed from the base of the PR and between b0edc0f and 9f38794.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (53)
  • Dockerfile
  • package.json
  • scripts/install.sh
  • src/lib/agent-defs.test.ts
  • src/lib/agent-defs.ts
  • src/lib/agent-onboard.test.ts
  • src/lib/agent-onboard.ts
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/branding.ts
  • src/lib/build-context.ts
  • src/lib/command-registry.test.ts
  • src/lib/command-registry.ts
  • src/lib/credentials.ts
  • src/lib/deploy.test.ts
  • src/lib/deploy.ts
  • src/lib/inference-config.test.ts
  • src/lib/inference-config.ts
  • src/lib/inventory-commands.test.ts
  • src/lib/inventory-commands.ts
  • src/lib/list-command-deps.ts
  • src/lib/list-command.ts
  • src/lib/local-inference.test.ts
  • src/lib/local-inference.ts
  • src/lib/nim.test.ts
  • src/lib/nim.ts
  • src/lib/oclif-commands.ts
  • src/lib/oclif-runner.ts
  • src/lib/onboard-command.test.ts
  • src/lib/onboard-command.ts
  • src/lib/onboard-inference-probes.ts
  • src/lib/onboard-ollama-proxy.ts
  • src/lib/onboard.ts
  • src/lib/openshell-timeouts.test.ts
  • src/lib/openshell-timeouts.ts
  • src/lib/policies.ts
  • src/lib/registry.ts
  • src/lib/resolve-openshell.test.ts
  • src/lib/resolve-openshell.ts
  • src/lib/sandbox-build-context.ts
  • src/lib/sandbox-config.ts
  • src/lib/sandbox-state.ts
  • src/lib/sandbox-version.test.ts
  • src/lib/services.ts
  • src/lib/share-command-deps.ts
  • src/lib/share-command.ts
  • src/lib/shields-timer.ts
  • src/lib/shields.ts
  • src/lib/skill-install.test.ts
  • src/lib/skill-install.ts
  • src/lib/validation.test.ts
  • src/lib/validation.ts
  • src/nemoclaw.ts
💤 Files with no reviewable changes (3)
  • src/lib/inference-config.test.ts
  • src/lib/inference-config.ts
  • src/lib/registry.ts
✅ Files skipped from review due to trivial changes (5)
  • package.json
  • src/lib/deploy.test.ts
  • src/lib/openshell-timeouts.ts
  • src/lib/sandbox-version.test.ts
  • src/lib/deploy.ts

Comment thread scripts/install.sh
Comment thread src/lib/build-context.ts
Comment on lines +76 to +83
if (failure.kind === "tls_cert_mismatch") {
console.error(
" Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",
);
console.error(" Fix: openshell gateway trust -g nemoclaw");
console.error(` Then: ${CLI_NAME} onboard --resume`);
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the TLS recovery hint branding-consistent.

This branch already uses CLI_NAME for the resume step, but the trust command still hard-codes nemoclaw. That produces a mixed recovery flow for branded runs.

Suggested fix
-    console.error("  Fix:  openshell gateway trust -g nemoclaw");
+    console.error(`  Fix:  openshell gateway trust -g ${CLI_NAME}`);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (failure.kind === "tls_cert_mismatch") {
console.error(
" Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",
);
console.error(" Fix: openshell gateway trust -g nemoclaw");
console.error(` Then: ${CLI_NAME} onboard --resume`);
return;
}
if (failure.kind === "tls_cert_mismatch") {
console.error(
" Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.",
);
console.error(` Fix: openshell gateway trust -g ${CLI_NAME}`);
console.error(` Then: ${CLI_NAME} onboard --resume`);
return;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/build-context.ts` around lines 76 - 83, The TLS recovery hint prints
a hard-coded gateway name ("nemoclaw") which mixes branding; in the failure.kind
=== "tls_cert_mismatch" branch replace the literal "openshell gateway trust -g
nemoclaw" string with one that uses the CLI_NAME constant (e.g. use string
interpolation so the command reads `${CLI_NAME} gateway trust -g ${CLI_NAME}` or
similar) so the trust hint and the resume hint both use CLI_NAME consistently.

Comment on lines +79 to +80
export function getResolvedOllamaHost(): string {
return _resolvedOllamaHost ?? OLLAMA_LOCALHOST;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve the Ollama host before falling back to localhost.

getResolvedOllamaHost() never calls findReachableOllamaHost(), so all of the new WSL-aware callers still default to 127.0.0.1 unless some unrelated path happened to seed the cache first. On WSL setups where Ollama is only reachable via host.docker.internal, validation, model discovery, warmup, and probe flows will still fail against localhost and the new fallback never engages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/local-inference.ts` around lines 79 - 80, getResolvedOllamaHost()
currently returns _resolvedOllamaHost or OLLAMA_LOCALHOST without attempting to
resolve the actual reachable host; update it to call findReachableOllamaHost()
when _resolvedOllamaHost is unset so WSL/docker cases (e.g.,
host.docker.internal) are discovered before falling back to OLLAMA_LOCALHOST;
specifically, inside getResolvedOllamaHost() call findReachableOllamaHost(),
await or handle its result as appropriate, assign that value to
_resolvedOllamaHost, and then return _resolvedOllamaHost or OLLAMA_LOCALHOST as
a final fallback.

Comment thread src/lib/nim.ts
Comment on lines +96 to 113
if (parsed.length > 0) {
const totalMemoryMB = parsed.reduce(
(sum: number, p: ParsedGpu) => sum + p.memoryMB,
0,
);
const firstName = parsed[0].name;
// Only surface a single name when every GPU reports the same model;
// a mixed-GPU host would otherwise be misreported as `Nx <firstName>`.
const allSameName =
!!firstName && parsed.every((p: ParsedGpu) => p.name === firstName);
return {
type: "nvidia",
count: perGpuMB.length,
...(allSameName ? { name: firstName } : {}),
count: parsed.length,
totalMemoryMB,
perGpuMB: perGpuMB[0],
perGpuMB: parsed[0].memoryMB,
nimCapable: canRunNimWithMemory(totalMemoryMB),
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid using the first GPU’s memory on heterogeneous hosts.

Omitting name for mixed-model machines prevents one kind of misreporting, but perGpuMB still comes from parsed[0], so mixed-memory hosts can surface the wrong per-GPU size.

Consider deriving perGpuMB from the parsed set or leaving it unset when the GPUs don’t match.

Suggested fix
-          perGpuMB: parsed[0].memoryMB,
+          perGpuMB: allSameName ? parsed[0].memoryMB : Math.floor(totalMemoryMB / parsed.length),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/nim.ts` around lines 96 - 113, The perGpuMB field currently uses
parsed[0].memoryMB which misreports heterogeneous hosts; update the logic in the
block that builds the return object (using parsed: ParsedGpu[], allSameName,
canRunNimWithMemory) to only set perGpuMB when every GPU has the same memory
(compute a new allSameMemory check like parsed.every(p => p.memoryMB ===
parsed[0].memoryMB)) or else omit/leave perGpuMB undefined; leave totalMemoryMB
and nimCapable as-is.

Comment on lines +497 to +548
function unloadOllamaModels() {
try {
const req = http.get(
{
hostname: "localhost",
port: OLLAMA_PORT,
path: "/api/ps",
timeout: 3000,
},
(res) => {
let data = "";
res.on("data", (chunk) => {
data += chunk;
});
res.on("end", () => {
if (res.statusCode !== 200) return;
try {
const parsed = JSON.parse(data);
const models = parsed.models || [];
for (const entry of models) {
if (!entry.name) continue;
const unloadReq = http.request(
{
hostname: "localhost",
port: OLLAMA_PORT,
path: "/api/generate",
method: "POST",
timeout: 3000,
headers: { "Content-Type": "application/json" },
},
() => {
/* ignore response */
},
);
unloadReq.on("error", () => {
/* best-effort */
});
unloadReq.write(JSON.stringify({ model: entry.name, keep_alive: 0 }));
unloadReq.end();
}
} catch {
/* best-effort */
}
});
},
);
req.on("error", () => {
/* best-effort */
});
} catch {
/* best-effort */
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Wait for the unload requests to finish before returning.

unloadOllamaModels() dispatches the /api/ps request and then returns immediately. Any caller that continues tearing services down can exit or stop Ollama before the /api/generate unload POSTs are sent, so this new GPU cleanup becomes race-prone and often a no-op. Please return a promise that settles after the list request completes and all unload requests finish, then await it from the stop path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-ollama-proxy.ts` around lines 497 - 548, unloadOllamaModels
currently fires the /api/ps GET and spawns /api/generate POSTs but returns
immediately; change unloadOllamaModels to return a Promise that resolves after
the /api/ps response is fully parsed and all unload POSTs have completed (or
rejected on fatal error). Implement by wrapping the initial http.get call in a
Promise, collecting per-unload request Promises (resolve on 'finish' or 'end'
and reject on 'error' or timeout) for each unloadReq created in the loop, and
resolve the main Promise only after Promise.all of those unload Promises
settles; keep the best-effort semantics by handling individual unload errors but
ensure the returned Promise only settles when all unload attempts have finished
so callers can await it.

Comment thread src/lib/onboard.ts
Comment thread src/lib/sandbox-build-context.ts
Comment thread src/lib/shields.ts
Comment on lines +259 to +299
function applyStateDirLockMode(sandboxName: string, configDir: string, owner: string): void {
for (const dirName of HIGH_RISK_STATE_DIRS) {
const dirPath = `${configDir}/${dirName}`;
try {
kubectlExec(sandboxName, ["chown", "-R", owner, dirPath]);
} catch {
// Directory may not exist for this agent — silently skip
}
try {
kubectlExec(sandboxName, ["chmod", "755", dirPath]);
kubectlExec(sandboxName, ["chmod", "-R", "go-w", dirPath]);
} catch {
// Silently skip
}
}

// Multi-agent OpenClaw workspaces are named workspace-<agent>. They are
// discovered dynamically because they are configured by openclaw.json.
try {
kubectlExec(sandboxName, [
"sh",
"-c",
`
set -u
config_dir="$1"
owner="$2"
for dir in "$config_dir"/workspace-*; do
[ -d "$dir" ] || continue
chown -R "$owner" "$dir" 2>/dev/null || true
chmod 755 "$dir" 2>/dev/null || true
chmod -R go-w "$dir" 2>/dev/null || true
done
`,
"sh",
configDir,
owner,
]);
} catch {
// Best effort; verification below catches the primary config lock.
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Verify the high-risk state directories instead of silently skipping failures.

applyStateDirLockMode() swallows every kubectlExec failure, and neither caller checks whether the directories that actually exist ended up with the requested owner/mode. If one of these operations fails, shields up can still persist a locked state while skills, hooks, workspace, or other NC-2227-05 paths remain writable. Please bubble up failures for existing directories, or add a post-check before saving state.

Comment thread src/nemoclaw.ts
Comment on lines +3838 to +3845
function probeGatewayRunning(): boolean {
const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
const result = dockerInspect(
["--type", "container", "--format", "{{.State.Running}}", container],
{ ignoreError: true, suppressOutput: true },
);
return result.status === 0 && String(result.stdout || "").trim() === "true";
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Locate dockerInspect wrapper and confirm supported options (timeout, suppressOutput)."
rg -n -C4 '\bdockerInspect\b' src lib

echo
echo "Show callsites using dockerInspect to compare option shapes."
rg -n -C2 '\bdockerInspect\s*\(' src

Repository: NVIDIA/NemoClaw

Length of output: 4032


🏁 Script executed:

cat -n src/lib/docker/run.ts | head -100

Repository: NVIDIA/NemoClaw

Length of output: 871


🏁 Script executed:

rg -n 'OPENSHELL_PROBE_TIMEOUT|timeout.*MS|TIMEOUT_MS' src lib

Repository: NVIDIA/NemoClaw

Length of output: 4738


🏁 Script executed:

rg -n 'type DockerRunOptions|interface DockerRunOptions' src lib

Repository: NVIDIA/NemoClaw

Length of output: 797


🏁 Script executed:

rg -n 'export.*function run|export.*const run' src/lib/runner --max-count 5

Repository: NVIDIA/NemoClaw

Length of output: 114


🏁 Script executed:

cat -n src/lib/runner.ts | head -100

Repository: NVIDIA/NemoClaw

Length of output: 3782


🏁 Script executed:

cat -n src/lib/runner.ts | sed -n '85,200p'

Repository: NVIDIA/NemoClaw

Length of output: 5146


🏁 Script executed:

rg -n 'probeGatewayRunning' src/nemoclaw.ts

Repository: NVIDIA/NemoClaw

Length of output: 188


🏁 Script executed:

cat -n src/nemoclaw.ts | sed -n '3850,3925p'

Repository: NVIDIA/NemoClaw

Length of output: 3943


Add a timeout to the gateway probe.

Line 3840 introduces a blocking dockerInspect call without an explicit timeout. If Docker is unhealthy, snapshot create/restore commands can hang indefinitely instead of failing fast. The constant OPENSHELL_PROBE_TIMEOUT_MS is already imported and widely used in the codebase for similar probes.

Suggested patch
 function probeGatewayRunning(): boolean {
   const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
   const result = dockerInspect(
     ["--type", "container", "--format", "{{.State.Running}}", container],
-    { ignoreError: true, suppressOutput: true },
+    { ignoreError: true, suppressOutput: true, timeout: OPENSHELL_PROBE_TIMEOUT_MS },
   );
   return result.status === 0 && String(result.stdout || "").trim() === "true";
 }

Run recommended E2E suites before merge: sandbox-survival-e2e, sandbox-operations-e2e.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function probeGatewayRunning(): boolean {
const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
const result = dockerInspect(
["--type", "container", "--format", "{{.State.Running}}", container],
{ ignoreError: true, suppressOutput: true },
);
return result.status === 0 && String(result.stdout || "").trim() === "true";
}
function probeGatewayRunning(): boolean {
const container = `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
const result = dockerInspect(
["--type", "container", "--format", "{{.State.Running}}", container],
{ ignoreError: true, suppressOutput: true, timeout: OPENSHELL_PROBE_TIMEOUT_MS },
);
return result.status === 0 && String(result.stdout || "").trim() === "true";
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 3838 - 3845, The gateway probe uses
dockerInspect in probeGatewayRunning and lacks a timeout; modify the
dockerInspect invocation to pass the shared OPENSHELL_PROBE_TIMEOUT_MS as the
operation timeout (alongside the existing ignoreError and suppressOutput
options) so the call fails fast when Docker is unresponsive; update the
dockerInspect call in probeGatewayRunning to include the timeout option and keep
the existing return logic based on result.status and result.stdout.

Comment thread src/nemoclaw.ts
Comment on lines +3852 to 3855
if (!probeGatewayRunning()) {
console.error(" Failed to query live sandbox state from OpenShell.");
process.exit(1);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify snapshot failure messaging for Docker-probe failures.

Lines 3853 and 3918 say OpenShell query failed, but this branch is triggered by the Docker container probe. The current message can misdirect troubleshooting.

Also applies to: 3917-3920

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 3852 - 3855, The error message printed when the
Docker container probe fails is misleadingly saying "Failed to query live
sandbox state from OpenShell"; update the messaging in the probe failure
branches (the check that calls probeGatewayRunning() around the if block and the
similar branch at lines handling the Docker probe near 3917-3920) to clearly
indicate a Docker probe/container health check failure (e.g., mention "Docker
probe" or "container probe" and the probe function name probeGatewayRunning) so
operators know the failure originated from the Docker/container probe rather
than an OpenShell query; modify both branches to use consistent, explicit text
and include any available probe error details if present.

BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request May 1, 2026
CodeRabbit review on NVIDIA#2696 flagged that the optimized build context
(sandbox-build-context.ts) whitelists individual scripts but was missing
scripts/rcf_patch.py. Since the Dockerfile COPYs it before the patching
RUN step, optimized nemoclaw <name> rebuild runs would fail with
'file not found in build context'.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg

Copy link
Copy Markdown
Contributor Author

Fixed the critical one: added scripts/rcf_patch.py to the optimized build context in sandbox-build-context.ts — without it, nemoclaw <name> rebuild would fail with 'file not found in build context' on optimized builds.

The other 9 comments are on files outside this PR's diff (upstream changes pulled in during rebase). Scoping this PR to only the Dockerfile/rcf_patch.py change.

@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 9836c9d to 6427855 Compare May 1, 2026 20:28
BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request May 1, 2026
…ness

Copilot review on NVIDIA#2696 flagged that even the regex-based inline python
one-liner is still fragile and hard to maintain. Extract the patch logic
to scripts/rcf_patch.py which uses a broad DOTALL regex anchored on the
function-call names rather than exact whitespace, making it resilient to
formatting changes across OpenClaw versions.

Per Copilot review on PR NVIDIA#2696.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request May 1, 2026
…aceConfigFile

Two issues from review on PR NVIDIA#2696:

1. SCRIPT_DIR is not defined in the Docker build context so the previous
   call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail.
   Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py
   before the RUN block and call it via the absolute in-image path.

2. The regex in rcf_patch.py could still miss due to ! await spacing and
   was not scoped to the replaceConfigFile function body.
   Fix: scope the match to the function body (walk braces to find it),
   and make all token separators whitespace-tolerant (! \s* await, \(\s*\{,
   ,\s*\{, optional trailing semicolon).

Per CodeRabbit/Copilot review on PR NVIDIA#2696.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
BenediktSchackenberg added a commit to BenediktSchackenberg/NemoClaw that referenced this pull request May 1, 2026
CodeRabbit review on NVIDIA#2696 flagged that the optimized build context
(sandbox-build-context.ts) whitelists individual scripts but was missing
scripts/rcf_patch.py. Since the Dockerfile COPYs it before the patching
RUN step, optimized nemoclaw <name> rebuild runs would fail with
'file not found in build context'.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@ericksoa

ericksoa commented May 2, 2026

Copy link
Copy Markdown
Contributor

Thanks for driving this. I checked the helper against the current openclaw@2026.4.24 dist shape, and this does address the reported failure path.

I found one follow-on robustness gap: the helper is whitespace-tolerant, but not actually property-order-tolerant for the tryWriteSingleTopLevelIncludeMutation({ snapshot, nextConfig }) object. I do not think that needs to block this PR once the current CI/DCO items are fixed, because the pinned 2026.4.24 package uses the order this helper matches.

I opened #2875 to track that as future hardening, and you are very welcome to pick it up if you agree with the feedback.

@ericksoa

ericksoa commented May 2, 2026

Copy link
Copy Markdown
Contributor

@BenediktSchackenberg I opened #2876 to replay this change onto a fresh branch and get it through the current gates. I kept the implementation based on your #2696 work, included you as co-author in the commit, and credited the original PR in the replacement PR body.

Thanks again for digging into the OpenClaw 2026.4.24 patch failure. This should preserve that fix while resolving the DCO/hadolint blocker and folding in the small property-order hardening follow-up.

ericksoa added a commit that referenced this pull request May 2, 2026
Replay the intent of #2696 by moving the OpenClaw replaceConfigFile patch into a staged helper script, then keep the Dockerfile patch path available in optimized rebuild contexts.

Harden the helper scanner so braces in quoted strings or comments cannot terminate replaceConfigFile body detection early, and add focused regression coverage for property-order drift, comment/string braces, and fail-close behavior.

Fixes #2689.
Closes #2875.

Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…aceConfigFile

Two issues from review on PR NVIDIA#2696:

1. SCRIPT_DIR is not defined in the Docker build context so the previous
   call to python3 "${SCRIPT_DIR}/scripts/rcf_patch.py" would always fail.
   Fix: add COPY scripts/rcf_patch.py /usr/local/lib/nemoclaw/rcf_patch.py
   before the RUN block and call it via the absolute in-image path.

2. The regex in rcf_patch.py could still miss due to ! await spacing and
   was not scoped to the replaceConfigFile function body.
   Fix: scope the match to the function body (walk braces to find it),
   and make all token separators whitespace-tolerant (! \s* await, \(\s*\{,
   ,\s*\{, optional trailing semicolon).

Per CodeRabbit/Copilot review on PR NVIDIA#2696.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/dockerfile-openclaw-version-2689 branch from 2c65003 to df82a7c Compare May 2, 2026 14:21
cjagwani added a commit that referenced this pull request May 5, 2026
Stress-test against 5 historical NemoClaw cases surfaced two real spec
gaps that produced incorrect classifications:

1. Evidence-required filter was too strict for partial-fix detection.
   The LLM couldn't cite "PR diff line" when the gap is what the PR
   did NOT touch. Extended evidence to three citation shapes:
   - DIRECT (original): cite specific PR line + issue symptom
   - BY-OMISSION: cite PR scope + issue symptom showing same bug class
     in different instances PR did not touch (catches partial-fix)
   - FOLLOW-ON: cite the symbol/file PR introduced + issue's request
     to harden the same code (catches "PR introduced X, now harden X")

2. ADJACENT_FIX wording assumed the issue was already broken and the
   PR resolves it. Reframed to also include "PR opens a clear follow-on
   path on the same code the PR just touched" — covers the common
   pattern where an issue requests hardening of code a PR just
   introduced.

Test cases that drove these patches:
- #2700#2762 (partial-fix, was failing — now passes via by-omission)
- #2696#2875 (follow-on hardening, was ambiguous — now explicit)
@wscurran

wscurran commented May 8, 2026

Copy link
Copy Markdown
Contributor

Closing as superseded by #2876. That PR replayed this Dockerfile replaceConfigFile hardening onto a fresh branch, included @BenediktSchackenberg as co-author, resolved the gate blockers, and merged on 2026-05-02. Thanks again for digging into the OpenClaw 2026.4.24 patch failure.

@wscurran wscurran closed this May 8, 2026
@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance dependencies Pull requests that update a dependency file platform: container Affects Docker, containerd, Podman, or images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants