fix(sandbox): preserve mutable OpenClaw config perms after doctor --fix (#4538)#4610
Conversation
…ix (NVIDIA#4538) OpenClaw's `openclaw doctor --fix` enforces a single-user 700/600 state layout, which silently tightens NemoClaw's mutable config contract: the gateway UID shares the `sandbox` group, so `/sandbox/.openclaw` must stay setgid + group-writable (2770) and `openclaw.json` group-writable (660), or gateway-side config writes (control-UI toggles, etc.) EACCES. - Add `shields.inspectMutableConfigPerms` / `repairMutableConfigPerms` (pure logic in `shields/mutable-config-perms.ts`, sandbox-bound wrappers in `shields/index.ts`). Repair reuses `unlockAgentConfig` and refuses to weaken an active shields-up lock. - Re-normalize mutable perms in `rebuild` right after its post-upgrade `openclaw doctor --fix` step. - Add a `Config permissions` check to `nemoclaw <name> doctor` that detects the 700/600 drift; `nemoclaw <name> doctor --fix` repairs it without a rebuild. `--fix` is rejected with `--json` so JSON automation cannot trigger a silent permission mutation. - Note in the startup normalizer and docs that a restart self-heals the drift, and document the supported gateway write probe (gosu / setpriv --init-groups, not su). Adds unit coverage for the contract predicates, inspect/repair orchestration, and the doctor check, plus a shell-level regression that a tree tightened to 700/600 is restored to 2770/660 on startup. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
This repository limits contributors to 10 open pull requests. Please close or merge existing PRs before opening new ones. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (4)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a mutually-exclusive ChangesMutable Config Permission Detection and Repair
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/reference/troubleshooting.mdx (1)
870-870: ⚡ Quick winReplace colon with a period or em dash.
This line uses a colon to introduce an explanation rather than a list, which violates the formatting rule that colons should only introduce lists.
📝 Suggested fix
-Do not probe with `su -s /bin/sh gateway ...`: `su` does not initialize the gateway's supplementary groups the same way, so a group-write probe can spuriously report `EACCES` even when the mutable contract is intact. +Do not probe with `su -s /bin/sh gateway ...`. +`su` does not initialize the gateway's supplementary groups the same way, so a group-write probe can spuriously report `EACCES` even when the mutable contract is intact.As per coding guidelines: "Colons should only introduce a list. Flag colons used as general punctuation between clauses."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/troubleshooting.mdx` at line 870, Replace the colon after the clause "Do not probe with `su -s /bin/sh gateway ...`" with a period or em dash so the sentence reads as a standalone statement (e.g., "Do not probe with `su -s /bin/sh gateway ...`. `su` does not initialize..." or use an em dash), ensuring the colon is not used as general punctuation in the troubleshooting note.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/doctor-config-perms.ts`:
- Around line 84-97: The failure detail string currently ignores after.reason
when inspect(sandboxName) throws and after.reason is set; update the
construction of the detail (the return block using before, after, repairResult)
to include after.reason as a fallback before falling back to "unknown" — e.g.,
when fixed is false and after.applies is true with no repairResult.errors,
append or use after.reason (from MutableConfigPermsInspection) so the message
becomes `repair incomplete: ${after.applies ? (after.issues.join("; ") ||
repairResult.errors.join("; ") || after.reason) : repairResult.errors.join("; ")
|| after.reason || "unknown"}` (adjust formatting to match code style),
referencing the inspect call, the after variable, and repairResult to locate the
change.
- Around line 33-38: The inspect(sandboxName) call currently swallows exceptions
and returns null; change the catch to capture the thrown error (e) and return a
diagnostic check object indicating a warning/failure with the error message so
the probe is surfaced instead of skipped—e.g., in the catch for
inspect(sandboxName) create and return a Check/Inspection result (matching the
surrounding return type used by doctor-config-perms) with status "warn" or
"fail" and include e.message (and e.stack if available) in the details; keep
using MutableConfigPermsInspection/inspect names to locate the code path to
modify.
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 853-861: The current handling of
shields.repairMutableConfigPerms(sandboxName) logs skipped or incomplete repairs
but still allows the rebuild to succeed; change the control flow in the
surrounding rebuild logic (where permRepair is computed) so that if
permRepair.applied === false OR permRepair.verified === false you treat the
rebuild as failed/incomplete: surface an explicit error to the user (using
log/console.error) and return or set the overall rebuild result to an error
state instead of continuing to success. Update any result/exit code or the
function that returns the rebuild outcome so callers see a non-success status
when repairMutableConfigPerms does not verify successfully (refer to
permRepair.applied, permRepair.verified, and permRepair.errors and the
surrounding rebuild function that currently prints success).
In `@src/lib/shields/mutable-config-perms.ts`:
- Around line 61-77: The predicates are too permissive; instead of testing
selected bits they must validate the full normalized mode equals the exact
contract (dir: 02770 / octal 0o2770, file: 00660 / octal 0o0660). Normalize the
input with mode.padStart(4,'0') (or parse to an int with
Number.parseInt(mode.padStart(4,'0'),8)) and then compare against the exact
expected value in dirSatisfiesMutableContract and fileSatisfiesMutableContract
(use '2770'/'0660' or numeric 0o2770/0o0660) so only exact matches pass.
---
Nitpick comments:
In `@docs/reference/troubleshooting.mdx`:
- Line 870: Replace the colon after the clause "Do not probe with `su -s /bin/sh
gateway ...`" with a period or em dash so the sentence reads as a standalone
statement (e.g., "Do not probe with `su -s /bin/sh gateway ...`. `su` does not
initialize..." or use an em dash), ensuring the colon is not used as general
punctuation in the troubleshooting note.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d85c2253-5f6a-4178-87f8-ce03299953ca
📒 Files selected for processing (14)
docs/reference/commands.mdxdocs/reference/troubleshooting.mdxscripts/lib/sandbox-init.shscripts/nemoclaw-start.shsrc/commands/sandbox/doctor.tssrc/lib/actions/sandbox/doctor-config-perms.test.tssrc/lib/actions/sandbox/doctor-config-perms.tssrc/lib/actions/sandbox/doctor.tssrc/lib/actions/sandbox/rebuild.tssrc/lib/cli/public-display-defaults.tssrc/lib/shields/index.tssrc/lib/shields/mutable-config-perms.test.tssrc/lib/shields/mutable-config-perms.tstest/repro-2681-group-writable.test.ts
…utable-openclaw-perms # Conflicts: # src/lib/shields/index.ts
…#4538) - Validate the exact 2770/660 contract in the mutable-config predicates so owner-tightened (2670) and world-widened (2777/666) modes are flagged, not just selected group bits (CodeRabbit). - doctor: surface an unexpected inspection-probe exception as a warn check instead of silently dropping the check; preserve the re-inspection failure reason in the repair-incomplete detail (CodeRabbit). - rebuild: run the mutable-permission repair LAST in the OpenClaw post-restore sequence — after the WeChat seed helper, which atomically rewrites openclaw.json 0600 and would otherwise undo the just-verified repair (Codex). - rebuild: when the final result is incomplete, report a failed state restore and an unverified permission repair independently so neither recovery hint is lost when both occur (Codex). Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
Pushed merge with upstream/main (resolved the
@coderabbitai full review |
|
(´• ω •`) ♡ ✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 886-899: Wrap the call to
shields.repairMutableConfigPerms(sandboxName) in a try/catch so unexpected
throws from validateName or getShieldsPosture (which may run
recoverExpiredAutoRestoreInline) do not abort the rebuild; on catch set
mutablePermsRepairUnverified = true and emit a warning/error log that includes
the caught error message/stack, then continue (do not rethrow). Keep existing
success/verified/verified=false handling for the normal permRepair return value
(repairMutableConfigPermsCore / unlockAgentConfig remain unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6e1277ae-e963-4508-a222-d429394bc960
📒 Files selected for processing (10)
docs/reference/commands.mdxdocs/reference/troubleshooting.mdxscripts/nemoclaw-start.shsrc/lib/actions/sandbox/doctor-config-perms.test.tssrc/lib/actions/sandbox/doctor-config-perms.tssrc/lib/actions/sandbox/rebuild.tssrc/lib/shields/index.tssrc/lib/shields/mutable-config-perms.test.tssrc/lib/shields/mutable-config-perms.tstest/repro-2681-group-writable.test.ts
💤 Files with no reviewable changes (1)
- test/repro-2681-group-writable.test.ts
✅ Files skipped from review due to trivial changes (2)
- scripts/nemoclaw-start.sh
- docs/reference/commands.mdx
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lib/shields/index.ts
- docs/reference/troubleshooting.mdx
- src/lib/shields/mutable-config-perms.ts
- src/lib/shields/mutable-config-perms.test.ts
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/commands.mdx`:
- Around line 484-491: Update the doctor command synopsis block and flag table
so the example is copyable and accurately shows the flags are mutually
exclusive: replace the console code fence (with `$` prompt) with a bash code
fence containing "nemoclaw my-assistant doctor [--json | --fix]" and update the
`--fix` flag description to append "Mutually exclusive with `--json`"; ensure
this change targets the synopsis and the table entries for the doctor command
(the doctor command example and the `--json`/`--fix` flag rows).
In `@docs/reference/troubleshooting.mdx`:
- Around line 855-868: Change the two "console" fenced blocks to copyable shell
blocks by switching the fences from ```console to ```bash (or ```sh) and remove
the leading "$" prompt characters; specifically update the "nemoclaw <sandbox>
doctor --fix" block and the verification block that uses setpriv/gosu so the
commands are plain, copyable lines (e.g., use "nemoclaw <sandbox> doctor --fix",
"setpriv --reuid=... -- sh -c 'echo ok >> /sandbox/.openclaw/openclaw.json'",
and the gosu variant) and ensure the file path /sandbox/.openclaw/openclaw.json
and the utilities setpriv/gosu are preserved as shown.
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 888-891: The current branch treats every permRepair.applied ===
false as a benign skip; update the handling in the rebuild flow so that after
calling repairMutableConfigPerms() you check permRepair.reason and only
log-and-continue for the explicit "shields-up" (intentional locked) reason,
while any other non-applied reasons should mark the rebuild as incomplete (e.g.,
call the existing downgrade/fail path or set the rebuild status to failure).
Concretely, in the block that inspects permRepair from
repairMutableConfigPerms(), replace the unconditional no-op/log with a
conditional that compares permRepair.reason to the explicit shields-up sentinel
used by repairMutableConfigPerms(), logging and continuing only for that
sentinel and invoking the downgrade/error path for all other reasons so the
mutable OpenClaw contract restoration is treated as incomplete.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ec644f24-c385-45ac-b652-41c5055e0eaf
📒 Files selected for processing (14)
docs/reference/commands.mdxdocs/reference/troubleshooting.mdxscripts/lib/sandbox-init.shscripts/nemoclaw-start.shsrc/commands/sandbox/doctor.tssrc/lib/actions/sandbox/doctor-config-perms.test.tssrc/lib/actions/sandbox/doctor-config-perms.tssrc/lib/actions/sandbox/doctor.tssrc/lib/actions/sandbox/rebuild.tssrc/lib/cli/public-display-defaults.tssrc/lib/shields/index.tssrc/lib/shields/mutable-config-perms.test.tssrc/lib/shields/mutable-config-perms.tstest/repro-2681-group-writable.test.ts
- rebuild: guard `shields.repairMutableConfigPerms()` in try/catch so an unexpected throw from the wrapper (validateName / getShieldsPosture inline auto-restore) cannot abort the rebuild — treat it as unverified and continue. - rebuild: distinguish skip reasons — a "locked" (shields-up) skip stays benign, but an "unreadable" shields-state skip means the contract may still be broken, so it now downgrades the rebuild result. Add a `skipReason` discriminant to MutableConfigRepairResult to avoid brittle reason-string matching. - docs: make the `doctor` synopsis copyable (`bash` fence, no `$`) and show `--json | --fix` are mutually exclusive; switch the troubleshooting repair and gateway-probe blocks to copyable `bash` fences. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
Addressed all four review threads in commit
@coderabbitai review |
|
Triggering the review now. ✅ Actions performedReview triggered.
|
|
@yimoj thanks for the upstream merge + review fixes here. Heads-up: it's still showing |
…utable-openclaw-perms # Conflicts: # docs/reference/commands.mdx # src/lib/actions/sandbox/doctor.ts
|
@prekshivyas resolved. |
…utable-openclaw-perms # Conflicts: # src/commands/sandbox/doctor.ts
…4628 review) Address @cv's PR NVIDIA#4628 review feedback: the 700/600 heal was too generic. If a future caller routes a mutable-sandbox OpenClaw config dir (which uses the 2770/660 contract per NVIDIA#4538 / PR NVIDIA#4610) through `ensureConfigDir`, the walker would silently normalize it to 700/600 — exactly the NVIDIA#4538 EACCES regression. Add three predicates: - `hostNemoclawDir()` — canonical `${HOME}/.nemoclaw` - `isHostNemoclawRoot(dirPath)` — gate for the sibling sweep - `isMutableSandboxConfigPath(targetPath)` — gate for the file-level heal Wire them through: - `ensureConfigDir`: skip dir-mode chmod when path is mutable-sandbox. Run `healRootLevelFiles` only when path IS the host nemoclaw root. - `readConfigFile`: skip per-file chmod when path is mutable-sandbox. Tests: - 2 new scope-boundary tests (negative: arbitrary dir, no heal; positive: host nemoclaw root, heal fires). - Existing heal tests (multi-sibling + symlink positive control) updated to use the new `withHome` helper so they place files under `<HOME>/.nemoclaw` instead of an arbitrary tmp dir — preserves the NVIDIA#4546 acceptance while respecting the new scope boundary. 18/18 tests pass. Co-authored-by: kagura-agent <kagura.agent.ai@gmail.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
@coderabbitai resume |
✅ Action performedReviews resumed. |
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
NVIDIA#4538) Two acceptance gaps from NVIDIA#4538 not closed by the original PR: - Troubleshooting: contrast the three relevant perm states (mutable default 2770/660, shields-up locked 444/755 root, the 700/600 drift) so issue ask NVIDIA#3, "docs cover both mutable and locked-down", is actually answered. Prior copy only documented mutable. - mutable-config-perms: explain why NemoClaw uses 2770/660 vs the issue's expected 2775/664 (gateway shares the sandbox group, so the "other" bit is intentionally dropped). The predicate test already rejects 664; the rationale belonged in code. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…VIDIA#4538) Third acceptance gap from NVIDIA#4538 audit: inspect/repair asymmetry. `unlockAgentConfig` (the repair callback) re-permissions both `openclaw.json` AND the sensitive-file set (`.config-hash`, etc.). `inspectMutableConfigPerms` only stat'd dir + main config, so the user-facing `nemoclaw <name> doctor` reported "Config permissions OK" even when `.config-hash` had drifted to `644 root:root`. Running `--fix` would then silently mutate a file the check never named. - Add optional `sensitiveFiles` to `MutableConfigTarget`. - Iterate sensitive files in inspect, applying the same file contract; tolerate missing files (e.g. `.config-hash` before the first shields-up cycle). - `resolveAgentConfig` already populates `sensitiveFiles`, so the wrapper in `shields/index.ts` needs no change — structural typing carries the field through. - Two new tests: drift detection + missing-file tolerance. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
cjagwani
left a comment
There was a problem hiding this comment.
Approving. Pushed two follow-ups to close the audit gaps: a docs paragraph on the three perm states (mutable 2770/660, shields-up 444/755 root, the 700/600 drift) plus a comment on why 660 not 664, and a fix so inspect actually catches .config-hash drift since --fix was silently mutating it. Two new tests cover the inspect side.
Verified on a brev T4 box: 2770/660 fresh, drifts to 700/600 after openclaw doctor --fix, comes back to 2770/660 after nemoclaw doctor --fix.
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…box doctor --fix (NVIDIA#4538) A raw connect-shell `openclaw doctor --fix` enforces OpenClaw's single-user 700/600 layout, tightening /sandbox/.openclaw and openclaw.json back from the NemoClaw mutable contract (2770 setgid + group-writable dir, 660 config). That blocks the gateway UID — a member of the sandbox group — from persisting config. PR NVIDIA#4610 only repaired this via host-side NemoClaw wrapper commands (`nemoclaw doctor --fix`, rebuild, startup); none run after the raw in-sandbox path the QA reporter used, so the gateway stayed broken until the next restart. Fix the missed path at the always-on in-sandbox guard: the emitted openclaw() shell function (sourced by every interactive/login sandbox shell) now re-asserts the contract after every routed openclaw command, regardless of exit code. - add _nemoclaw_restore_mutable_config_perms to the guard env; skips when shields are up (config dir root-owned), no-op via fast path when the contract already holds, and strips group-write from the read-only recovery baseline the recursive chmod could otherwise loosen - drop errexit around the guarded call (mirroring the devices-approve branch) so a nonzero `doctor --fix` exit cannot abort the guard before the restore runs — the nonzero-exit case is the exact reported failure - regression tests plus a gated real-image reporter-workflow E2E that drives raw `openclaw doctor --fix` and asserts 2770/660 is restored Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
…box doctor --fix (#4538) (#5017) ## Summary A raw connect-shell `openclaw doctor --fix` (run directly inside a mutable sandbox, outside any NemoClaw wrapper) enforces OpenClaw's single-user 700/600 layout, tightening `/sandbox/.openclaw` and `openclaw.json` back from the NemoClaw mutable contract (2770 setgid + group-writable dir, 660 config). That blocks the gateway UID — a member of the `sandbox` group — from persisting config. This fixes the missed path at the always-on in-sandbox `openclaw()` guard so the contract is restored after the raw command, even when `doctor` exits nonzero. ## Related Issue Fixes #4538 ## Changes - **Always-on in-sandbox restore.** The emitted `openclaw()` guard function (written into `/tmp/nemoclaw-proxy-env.sh`, sourced by every interactive/login sandbox shell) now calls a new `_nemoclaw_restore_mutable_config_perms` helper after every routed openclaw command, re-asserting `2770` dir / `660` config (+ `.config-hash`). PR #4610 only repaired this via host-side wrapper commands (`nemoclaw doctor --fix`, rebuild, startup), none of which run after the raw in-sandbox path the reporter used. - **Safe by construction.** The helper skips when shields are up (config dir root-owned, so the lock is never weakened), is a no-op fast path when the contract already holds, and strips group-write from the read-only recovery baseline that the recursive `chmod` could otherwise loosen in rootless mode. - **Survives nonzero exit.** Errexit is dropped around the guarded call (mirroring the existing `devices approve` branch) so a nonzero `doctor --fix` exit — the exact reported failure (EACCES on a root-locked shell init file) — cannot abort the guard before the restore runs. - **Tests.** Regression tests for the helper (restore, shields-up no-op, errexit survival, baseline protection) and a real-image reporter-workflow E2E (`test/repro-4538-raw-doctor-perms.test.ts`) that sources the actual emitted guard, runs raw `openclaw doctor --fix` in a NemoClaw sandbox image, and asserts the contract is restored. The E2E is gated behind `NEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1` so it runs where docker + the image are available. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npm test` passes locally (the only failures were 14 pre-existing, load-induced 5s-timeout flakes in unrelated CLI `--help` oclif-dispatch tests, e.g. `test/cli/snapshot-shields.test.ts`; they do not touch this change). All CI checks are green: ShellCheck, static-checks, 5 cli-test-shards, plugin-tests, build-typecheck, CodeQL, macos/wsl-e2e, dco-check, codebase-growth-guardrails. - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ### Reporter-workflow E2E evidence The exact reported workflow — connect to a mutable sandbox, source the always-on guard (as `/etc/bash.bashrc` does on connect), run the **raw in-sandbox `openclaw doctor --fix`** — was reproduced end-to-end against a real NemoClaw sandbox image (`nemoclaw-production:latest`), as the `sandbox` user. - **Test:** `test/repro-4538-raw-doctor-perms.test.ts` - **Env / command:** `NEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1 npx vitest run test/repro-4538-raw-doctor-perms.test.ts` ```text === BEFORE (baseline mutable contract) === /sandbox/.openclaw 2770 sandbox:sandbox /sandbox/.openclaw/openclaw.json 660 sandbox:sandbox === reporter cmd: raw `openclaw doctor --fix` === (without the fix this leaves: dir 700 sandbox:sandbox, openclaw.json 600 sandbox:sandbox) === AFTER (with the always-on openclaw() guard) === /sandbox/.openclaw 2770 sandbox:sandbox /sandbox/.openclaw/openclaw.json 660 sandbox:sandbox RESULT dir=2770 file=660 PASS: mutable contract (2770/660) restored after raw openclaw doctor --fix ``` The before-fix regression (dir→700, file→600) was also reproduced directly in the same image without the guard sourced, confirming the gap. Additional non-docker regression tests in the same file cover the restore helper, shields-up no-op, errexit survival on a nonzero `doctor --fix`, and recovery-baseline protection. --- Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
## Summary The "Selective E2E Results" comment posted by `.github/workflows/nightly-e2e.yaml` bucketed job results into `passed` / `failed` / `skipped` but never accounted for `cancelled`. A job ending in `cancelled` (e.g. when `cancel-in-progress` kills a stale run) slipped past all three tallies and fell through to the default `"✅ All requested jobs passed"` status, with the summary line reading `"0 passed, 0 failed, 0 skipped"` — masking that the run produced no signal at all. ## Repro (in the wild) PR #5085 commit `026802ba8`: ```text ### Selective E2E Results — ✅ All requested jobs passed Run: 27305033204 Requested jobs: agent-turn-latency-e2e,cloud-inference-e2e Summary: 0 passed, 0 failed, 0 skipped | Job | Result | |------------------------|---------------| | agent-turn-latency-e2e |⚠️ cancelled | | cloud-inference-e2e |⚠️ cancelled | ``` Both requested jobs were cancelled (cancel-in-progress superseding the older run) yet the headline read green. Same pattern earlier on #4610. ## Root `.github/workflows/nightly-e2e.yaml:2486-2495` (pre-fix): ```js const passed = ran.filter(([, v]) => v.result === 'success'); const failed = ran.filter(([, v]) => v.result === 'failure'); const skipped = reportedEntries.filter(([, v]) => v.result === 'skipped'); const status = failed.length > 0 || missingRequested.length > 0 ? '❌ Some jobs failed' : skipped.length > 0 && passed.length === 0 ? '⚠️ No requested jobs ran' : '✅ All requested jobs passed'; ``` A `cancelled` job is `!== 'success'`, `!== 'failure'`, `!== 'skipped'` — falls through to the default. ## Changes - Add a `cancelled` bucket derived from `ran` the same way the others are. - Insert a status branch between the failure case and the no-ran case: when cancelled jobs are present and nothing passed, surface `⚠️ Run cancelled — no signal` instead of falsely claiming success. - Include cancelled count in the summary line so the bucket is visible even when the other states are zero. Successful, failed, and skipped runs continue to render exactly as before — only the cancelled case changes from "false green" to "honest yellow." cc @jyaunches (flagged this earlier in slack) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Improved nightly e2e test reporting in PR comments to better distinguish cancelled jobs from other outcomes. PR comments now display cancelled job counts and provide clearer status messaging when test runs are cancelled. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Summary
openclaw doctor --fixenforces a single-user700/600state layout, which silently breaks NemoClaw's mutable config contract: the gateway UID shares thesandboxgroup, so/sandbox/.openclawmust stay setgid + group-writable (2770) andopenclaw.jsongroup-writable (660), or gateway-side config writes (control-UI toggles, etc.)EACCES. This preserves that contract after NemoClaw's own rebuild repair step and adds a detect/repair path for sandboxes a user already tightened.Related Issue
Fixes #4538
Changes
shields.inspectMutableConfigPerms/repairMutableConfigPerms— pure contract logic insrc/lib/shields/mutable-config-perms.ts, sandbox-bound wrappers insrc/lib/shields/index.ts. Repair reusesunlockAgentConfigand refuses to weaken an active shields-up lock or a corrupt-state sandbox.rebuildre-normalizes mutable perms right after its post-upgradeopenclaw doctor --fixstep.nemoclaw <name> doctorgains aConfig permissionscheck that detects the700/600drift and warns;nemoclaw <name> doctor --fixrepairs it without a rebuild.--fixis rejected together with--jsonso JSON automation can't trigger a silent permission mutation.setpriv … --init-groups/gosu gateway, notsu).doctorcheck rendering, and a shell-level regression that a tree tightened to700/600is restored to2770/660on startup.Type of Change
Verification
npm testpasses (all failures isolated to the host umask027and the unrelated flaky CLI-dispatch tests; the affected files pass in isolation with plugin deps installed)src/lib/shields/,src/lib/actions/sandbox/doctor-config-perms.test.ts,test/repro-2681-group-writable.test.tsSigned-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests