fix(sandbox): refresh plugin registry after gateway start to recover non-bundled plugins (#2021)#4681
Conversation
…non-bundled plugins Under GPU sandbox onboard, OpenClaw's policy-change registry regen rebuilds plugins[] from bundled extensions only, dropping path-origin (nemoclaw) and npm-origin (openclaw-weixin) entries from the runtime registry view. Their installRecords survive on disk, but the runtime view forgets them — so the /nemoclaw slash command is unreachable in the TUI and `openclaw plugins inspect nemoclaw` returns "Plugin not found". Reproduced today byte-for-byte on a fresh Brev T4 GPU instance, matching wangericnv's evidence on a 0690 Spark (#2021). Run `openclaw plugins registry --refresh` as the sandbox user after the gateway has started. Backgrounded so the gateway-wait loop is not blocked; PID is tracked in SANDBOX_CHILD_PIDS so SIGTERM still reaps it. Failure is non-fatal so the gateway can still serve other plugins. Runs once per cold start, so it also covers later policy mutations that re-trigger the regen. This is a temporary workaround. The permanent fix is upstream in OpenClaw's regen logic (openclaw/openclaw#89606) — once that lands, this block should be removed. Fixes #2021 Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Extracts the workaround block from scripts/nemoclaw-start.sh and drives it under bash with a stub openclaw binary. Verifies: 1. The refresh fires once `openclaw gateway status` reports ready. 2. The refresh runs with HOME=/sandbox even when the parent shell has HOME=/root — protects against the root-mode install bug where the plugin lands in /root/.openclaw/extensions/ instead of /sandbox/.openclaw/extensions/ and silently fails to repopulate the runtime plugins[]. 3. PLUGIN_REFRESH_PID is captured and appended to SANDBOX_CHILD_PIDS so SIGTERM cleanup reaps the backgrounded subshell. 4. The loop waits across multiple `gateway status` failures before refreshing — simulates cold-start where the gateway needs a few seconds to start serving. Behavior-shaped (not source-text); source-shape budget stays at 0. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a background, gateway-readiness-gated plugin registry refresh that logs to /tmp/nemoclaw-plugin-refresh.log, hardens /tmp log validation (rejects symlinks/non-regular files and tightens modes/ownership), and adds extraction/harness unit tests plus an e2e SSH plugin presence check. ChangesPlugin registry refresh workaround
Sequence Diagram(s)sequenceDiagram
participant StartScript as start_plugin_registry_refresh
participant OpenClaw as openclaw
participant LogFile as /tmp/nemoclaw-plugin-refresh.log
StartScript->>OpenClaw: gateway status probe
OpenClaw-->>StartScript: status response
alt gateway becomes ready
StartScript->>OpenClaw: HOME=/sandbox plugins registry --refresh
OpenClaw->>LogFile: append stdout/stderr
StartScript->>StartScript: write PLUGIN_REFRESH_PID, append to SANDBOX_CHILD_PIDS
else gateway never ready (after probes)
StartScript-->>StartScript: skip refresh (non-fatal)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/nemoclaw-start-plugin-refresh.test.ts`:
- Line 154: The TypeScript error occurs because result.stdout is typed as string
| NonSharedBuffer; in the regex match assignment for tracked you should cast
result.stdout to string before calling .match (e.g., use (result.stdout as
string)). Update the expression that defines tracked (the line using
result.stdout.match(/^SANDBOX_CHILD_PIDS=(.+)$/m) in the test) to cast
result.stdout to string so the regex call is properly typed.
- Line 151: The TypeScript error comes from using result.stdout.match(...) where
result.stdout is typed string | NonSharedBuffer; fix by normalizing stdout to a
string before calling match: create a local stdout variable (e.g., const stdout
= typeof result.stdout === 'string' ? result.stdout :
result.stdout.toString('utf8')) and then use
stdout.match(/^PLUGIN_REFRESH_PID=(\d+)$/m)?.[1] to assign pid; reference the
pid assignment and result.stdout in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c850cf94-983b-477c-8d2a-657a7cece7de
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/nemoclaw-start-plugin-refresh.test.ts
PR Review AdvisorFindings: 1 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Selective E2E Results — ✅ All requested jobs passedRun: 26847979075
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
scripts/nemoclaw-start.sh (1)
2796-2821: 💤 Low valueSubshell uses
returninstead ofexit.Inside the backgrounded subshell at line 2808,
return 0works but is semantically unusual — in a subshell context,exit 0is the idiomatic choice since there's no calling function to return to. The behavior is equivalent here (both exit the subshell with status 0), but usingexitmakes the intent clearer.Suggested fix
if [ "$ready" -ne 1 ]; then echo "[plugin-refresh] gateway did not become ready; skipping registry refresh" >&2 - return 0 + exit 0 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 2796 - 2821, The subshell in start_plugin_registry_refresh uses "return 0" to terminate the backgrounded subshell; change that to "exit 0" so the intent to exit the subshell is clear and idiomatic — locate the start_plugin_registry_refresh function and replace the "return 0" inside the subshell (the block that checks ready and prints "[plugin-refresh] gateway did not become ready; skipping registry refresh") with "exit 0".test/e2e/test-full-e2e.sh (1)
274-286: ⚡ Quick winUse an argv array for the optional timeout wrapper.
Storing
timeout 90/gtimeout 90in a scalar and expanding it unquoted relies on word-splitting and is likely to trip ShellCheck. An array keeps this shellcheck-clean and avoids accidental argument splitting.Suggested refactor
-PLUGIN_CHECK_TIMEOUT_CMD="" -command -v timeout >/dev/null 2>&1 && PLUGIN_CHECK_TIMEOUT_CMD="timeout 90" -command -v gtimeout >/dev/null 2>&1 && PLUGIN_CHECK_TIMEOUT_CMD="gtimeout 90" +plugin_check_timeout_cmd=() +command -v timeout >/dev/null 2>&1 && plugin_check_timeout_cmd=(timeout 90) +command -v gtimeout >/dev/null 2>&1 && plugin_check_timeout_cmd=(gtimeout 90) if openshell sandbox ssh-config "$SANDBOX_NAME" >"$ssh_config" 2>/dev/null; then for plugin_attempt in 1 2 3 4 5; do - plugin_check_output=$($PLUGIN_CHECK_TIMEOUT_CMD ssh -F "$ssh_config" \ + plugin_check_output=$("${plugin_check_timeout_cmd[@]}" ssh -F "$ssh_config" \ -o StrictHostKeyChecking=no \ -o UserKnownHostsFile=/dev/null \ -o ConnectTimeout=10 \ -o LogLevel=ERROR \ "openshell-${SANDBOX_NAME}" \ "HOME=/sandbox openclaw plugins inspect nemoclaw >/tmp/nemoclaw-e2e-plugin-inspect.log 2>&1 && HOME=/sandbox openclaw nemoclaw --help >/tmp/nemoclaw-e2e-plugin-help.log 2>&1 && printf 'plugin-ok'" \ 2>&1) || trueAs per coding guidelines:
**/*.sh: All shell scripts must have shebangs and be executable, with ShellCheck enforcement via.shellcheckrcand formatting viashfmt.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-full-e2e.sh` around lines 274 - 286, The current PLUGIN_CHECK_TIMEOUT_CMD scalar relies on unquoted word-splitting; change it to an argv-style array (e.g. PLUGIN_CHECK_TIMEOUT_CMD=() or PLUGIN_CHECK_TIMEOUT_CMD=(timeout 90) / PLUGIN_CHECK_TIMEOUT_CMD=(gtimeout 90)) and invoke it using array expansion ("${PLUGIN_CHECK_TIMEOUT_CMD[@]}") when building the ssh command that assigns plugin_check_output; update the logic that sets PLUGIN_CHECK_TIMEOUT_CMD (the command -v checks) and the command substitution that runs ssh -F "$ssh_config" ... "openshell-${SANDBOX_NAME}" so they use the array form to avoid accidental argument splitting and satisfy ShellCheck.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/nemoclaw-start.test.ts`:
- Line 3207: The test defines pluginRefreshLog (const pluginRefreshLog) and
injects prepare_plugin_refresh_log but never asserts that the refresh log path
is actually prepared; update the test to explicitly verify pluginRefreshLog by
either asserting the test harness stdout/stderr contains the pluginRefreshLog
path or by checking the file exists on-disk (e.g.,
fs.existsSync(pluginRefreshLog)) after the run; apply the same explicit
assertion at the other occurrence noted around lines 3246-3247 where
prepare_plugin_refresh_log is used.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 2796-2821: The subshell in start_plugin_registry_refresh uses
"return 0" to terminate the backgrounded subshell; change that to "exit 0" so
the intent to exit the subshell is clear and idiomatic — locate the
start_plugin_registry_refresh function and replace the "return 0" inside the
subshell (the block that checks ready and prints "[plugin-refresh] gateway did
not become ready; skipping registry refresh") with "exit 0".
In `@test/e2e/test-full-e2e.sh`:
- Around line 274-286: The current PLUGIN_CHECK_TIMEOUT_CMD scalar relies on
unquoted word-splitting; change it to an argv-style array (e.g.
PLUGIN_CHECK_TIMEOUT_CMD=() or PLUGIN_CHECK_TIMEOUT_CMD=(timeout 90) /
PLUGIN_CHECK_TIMEOUT_CMD=(gtimeout 90)) and invoke it using array expansion
("${PLUGIN_CHECK_TIMEOUT_CMD[@]}") when building the ssh command that assigns
plugin_check_output; update the logic that sets PLUGIN_CHECK_TIMEOUT_CMD (the
command -v checks) and the command substitution that runs ssh -F "$ssh_config"
... "openshell-${SANDBOX_NAME}" so they use the array form to avoid accidental
argument splitting and satisfy ShellCheck.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aca83091-8fe3-49a7-9b5e-fd1d7f9d6140
📒 Files selected for processing (6)
scripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/e2e/test-full-e2e.shtest/nemoclaw-start-plugin-refresh.test.tstest/nemoclaw-start.test.tstest/sandbox-init.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/nemoclaw-start-plugin-refresh.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 26849857215
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e-gateway-isolation.sh (1)
516-519: ⚡ Quick winMake the sed extraction more whitespace-tolerant (current pattern works as-is)
The new range
/^apply_model_override() {/,/^}/pmatches the currentapply_model_overridefunction inscripts/nemoclaw-start.sh(non-empty extraction: 138 lines; starts withapply_model_override() {and ends with a line containing}). For robustness, consider allowing leading whitespace / brace-style variations (e.g.,^[[:space:]]*apply_model_override\(\)[[:space:]]*{…^[[:space:]]*}) and/or guarding against an empty extraction beforesource <(...)(Lines 516-519, 549).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-gateway-isolation.sh` around lines 516 - 519, Make the sed range extraction more robust for the apply_model_override function and avoid sourcing an empty heredoc: change the sed pattern used in the source <(sed -n "...") call to accept optional leading whitespace and spacing around the parentheses/brace (match something like ^[[:space:]]*apply_model_override[[:space:]]*\(\)[[:space:]]*{ ... ^[[:space:]]*}) so it tolerates different brace/indent styles, and instead of piping directly into source capture the sed output first (e.g., into a temp file or variable), test that the extraction is non-empty (and non-zero size) and only then source it; target the invocation that currently references apply_model_override and the source <(sed -n "/^apply_model_override() {/,/^}/p" /usr/local/bin/nemoclaw-start).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e-gateway-isolation.sh`:
- Around line 516-519: Make the sed range extraction more robust for the
apply_model_override function and avoid sourcing an empty heredoc: change the
sed pattern used in the source <(sed -n "...") call to accept optional leading
whitespace and spacing around the parentheses/brace (match something like
^[[:space:]]*apply_model_override[[:space:]]*\(\)[[:space:]]*{ ...
^[[:space:]]*}) so it tolerates different brace/indent styles, and instead of
piping directly into source capture the sed output first (e.g., into a temp file
or variable), test that the extraction is non-empty (and non-zero size) and only
then source it; target the invocation that currently references
apply_model_override and the source <(sed -n "/^apply_model_override() {/,/^}/p"
/usr/local/bin/nemoclaw-start).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 94c96490-d84b-4ad6-ab87-96474d0c78ce
📒 Files selected for processing (2)
test/e2e-gateway-isolation.shtest/nemoclaw-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/nemoclaw-start.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 26850629152
|
Selective E2E Results — ❌ Some jobs failedRun: 26855454078
|
|
✨ |
Selective E2E Results — ❌ Some jobs failedRun: 26909213317
|
Selective E2E Results — ❌ Some jobs failedRun: 26914147793
|
Selective E2E Results — ❌ Some jobs failedRun: 26914468544
|
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26975910960
|
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26976659667
|
Selective E2E Results — ✅ All requested jobs passedRun: 26976848310
|
Selective E2E Results — ✅ All requested jobs passedRun: 26977433313
|
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.
## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.
## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.
Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated default model for local Ollama inference setup to qwen3.5:9b
* Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Temporary workaround for #2021. Under GPU sandbox onboard, OpenClaw's policy-change registry regen rebuilds
plugins[]from bundled extensions only, dropping path-origin (nemoclaw) and npm-origin (openclaw-weixin) entries from the runtime registry view. TheirinstallRecordssurvive on disk, but the runtime view forgets them — so the/nemoclawslash command is unreachable in the TUI andopenclaw plugins inspect nemoclawreturns "Plugin not found".Run
openclaw plugins registry --refreshas the sandbox user withHOME=/sandboxafter the gateway has started; the refresh repopulatesplugins[]frominstallRecords. Backgrounded so the gateway-wait loop is not blocked; PID is tracked inSANDBOX_CHILD_PIDSso SIGTERM still reaps it. Failure is non-fatal. Runs once per cold start and heals the initial post-start policy-regeneration regression. Later runtime policy mutations remain owned by the upstream OpenClaw fix tracked in openclaw/openclaw#89606.Related Issue
Fixes #2021.
Why this is temporary
The root cause is upstream in OpenClaw's
plugins registry --refresh(refreshReason: policy-changed) path: regen should rebuildplugins[]frominstallRecords ∪ bundled discovery, not bundled only. Tracking upstream as openclaw/openclaw#89606. Once that lands, this workaround block should be removed.Reproduction confirmed today
Byte-for-byte reproduction on a fresh Brev T4 GPU instance (n1-highcpu-4:nvidia-tesla-t4:1), matching @wangericnv's evidence on the 0690 Spark:
{ "refresh_reason": "policy-changed", "has_nemoclaw_install_record": true, "nemoclaw_in_plugins_array": 0, "has_weixin_install_record": true, "weixin_in_plugins_array": 0, "install_record_origins": [{"nemoclaw": "path"}, {"openclaw-weixin": "npm"}] }Verified
openclaw plugins registry --refreshas the sandbox user repopulates the runtime registry and restoresopenclaw nemoclaw --help+openclaw plugins listrecognition. Full evidence: #2021 (comment).Changes
scripts/nemoclaw-start.sh: post-start_auto_pairblock runs the refresh in a backgrounded subshell after the gateway becomes responsive; PID tracked inSANDBOX_CHILD_PIDSfor cleanup.test/nemoclaw-start-plugin-refresh.test.ts: behavior test that extracts the production block and drives it against a stubopenclawbinary — verifies the refresh fires, both the readiness probe and refresh run withHOME=/sandboxeven when the parent shell hasHOME=/root, the PID is captured + tracked, the gateway-status wait loop actually loops across multiple failures, and the installRecords-present/plugins-missing slash-router state is healed without enabling stale slash records.Type of Change
Verification
npx vitest run --project cli test/nemoclaw-start-plugin-refresh.test.ts test/sandbox-init.test.ts— 49 / 49 passnpm run source-shape:check— 0 source-shape cases (behavior-shaped test)bash -n scripts/nemoclaw-start.sh scripts/lib/sandbox-init.sh test/e2e/test-full-e2e.sh test/e2e-gateway-isolation.sh— syntax OKopenclaw nemoclaw --helpandopenclaw plugins listrecognition of nemoclaw + weixinSummary by CodeRabbit
Bug Fixes
Tests
Documentation
Latest E2E failure notes
The repeated
cloud-e2eandopenclaw-onboard-security-posture-e2efailures had two layers:HOME=/root. In root mode the probe could not see/sandbox/.openclawgateway state, skipped the refresh, and the plugin registry/slash-alias check failed with empty output because command logs were redirected inside the sandbox. The readiness probe now steps down withHOME=/sandbox.openclaw plugins inspect nemoclawpassed, proving the registry refresh healed the [All Platform] /nemoclaw slash command not working in sandbox OpenClaw TUI #2021 path, butopenclaw nemoclaw --helpexited non-zero due an unrelated OpenClaw config warning for a missingqqbotplugin entry. The E2E now treats only NemoClaw-specific missing-command/help failures as fatal while keeping inspect + runtime-slash manifest checks as the acceptance signal.