fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint#4944
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
💤 Files with no reviewable changes (5)
📝 WalkthroughWalkthroughThis PR centralizes RLIMIT hardening into ChangesSandbox Resource Limit Hardening
sequenceDiagram
participant Nemoclaw as scripts/nemoclaw-start.sh
participant Hermes as agents/hermes/start.sh
participant Init as scripts/lib/sandbox-init.sh
participant Harden as harden_resource_limits
participant Ulimit as ulimit
Nemoclaw->>Init: source sandbox-init.sh
Nemoclaw->>Harden: call harden_resource_limits
Harden->>Ulimit: set nproc soft/hard
Harden->>Ulimit: set nofile soft/hard
Harden-->>Nemoclaw: return (best-effort)
Hermes->>Init: source sandbox-init.sh
Hermes->>Harden: call harden_resource_limits
Harden->>Ulimit: set nproc soft/hard
Harden->>Ulimit: set nofile soft/hard
Harden-->>Hermes: return (best-effort)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
test/sandbox-init.test.ts (1)
451-473: ⚡ Quick winAnchor ordering/source-block assertions to executable command lines, not free-text matches.
Current
indexOf(...)checks can match comments, which can let regressions slip through or make tests fail on harmless comment edits.Suggested tightening
- const hardenIdx = src.indexOf("harden_resource_limits"); - const dropIdx = src.indexOf("drop_capabilities"); + const hardenMatch = src.match(/^\s*harden_resource_limits\s*$/m); + const dropMatch = src.match(/^\s*drop_capabilities\b.*$/m); + const hardenIdx = hardenMatch?.index ?? -1; + const dropIdx = dropMatch?.index ?? -1; - const end = src.indexOf("# Harden RLIMITs", start); + const hardenCallFromStart = src.slice(start).match(/^\s*harden_resource_limits\s*$/m); + const end = hardenCallFromStart ? start + (hardenCallFromStart.index ?? 0) : -1;Also applies to: 642-643
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/sandbox-init.test.ts` around lines 451 - 473, The tests that compute hardenIdx/dropIdx using src.indexOf(...) can match commented text; update the assertions to search for executable command lines instead of free-text matches by locating whole-line, non-comment occurrences of "harden_resource_limits" and "drop_capabilities" (for example by splitting src into lines and finding the first line that matches /^\s*harden_resource_limits\b/ and /^\s*drop_capabilities\b/ or using a multi-line regex that ignores lines starting with #); then assert those line indices are present and that the harden line appears before the drop line. Reference the test variables rel and the symbols "harden_resource_limits" and "drop_capabilities" when making the change.scripts/nemoclaw-start.sh (1)
70-73: Run the entrypoint E2E set for this PID 1 hardening path.As per coding guidelines: for
scripts/nemoclaw-start.sh, runsandbox-survival-e2e,sandbox-operations-e2e,cloud-e2e, andopenclaw-slack-pairing-e2ebecause these changes are startup-path sensitive and unit tests may not surface runtime lifecycle regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 70 - 73, This change hardens PID 1 startup via the harden_resource_limits call in scripts/nemoclaw-start.sh and needs the entrypoint E2E test set run to catch lifecycle/startup regressions; run sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the path exercising harden_resource_limits) and report any failures so they can be fixed before merge.Source: Coding guidelines
agents/hermes/start.sh (1)
31-33: Run Hermes E2Es for this startup-path hardening change.As per coding guidelines: for
agents/hermes/**, runhermes-e2e,hermes-inference-switch-e2e,hermes-discord-e2e,hermes-slack-e2e,rebuild-hermes-e2e, andrebuild-hermes-stale-base-e2eto validate onboarding, health, and routing behavior after entrypoint changes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agents/hermes/start.sh` around lines 31 - 33, Run the hermes E2E suites to validate the startup-path change to harden_resource_limits in agents/hermes/start.sh: execute hermes-e2e, hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e, rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e; confirm onboarding, health checks, and routing behavior pass and inspect logs for any startup/permission failures—if any tests fail, revert or adjust the harden_resource_limits invocation in start.sh and re-run the failing suites.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@agents/hermes/start.sh`:
- Around line 31-33: Run the hermes E2E suites to validate the startup-path
change to harden_resource_limits in agents/hermes/start.sh: execute hermes-e2e,
hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e,
rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e; confirm onboarding,
health checks, and routing behavior pass and inspect logs for any
startup/permission failures—if any tests fail, revert or adjust the
harden_resource_limits invocation in start.sh and re-run the failing suites.
In `@scripts/nemoclaw-start.sh`:
- Around line 70-73: This change hardens PID 1 startup via the
harden_resource_limits call in scripts/nemoclaw-start.sh and needs the
entrypoint E2E test set run to catch lifecycle/startup regressions; run
sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and
openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the
path exercising harden_resource_limits) and report any failures so they can be
fixed before merge.
In `@test/sandbox-init.test.ts`:
- Around line 451-473: The tests that compute hardenIdx/dropIdx using
src.indexOf(...) can match commented text; update the assertions to search for
executable command lines instead of free-text matches by locating whole-line,
non-comment occurrences of "harden_resource_limits" and "drop_capabilities" (for
example by splitting src into lines and finding the first line that matches
/^\s*harden_resource_limits\b/ and /^\s*drop_capabilities\b/ or using a
multi-line regex that ignores lines starting with #); then assert those line
indices are present and that the harden line appears before the drop line.
Reference the test variables rel and the symbols "harden_resource_limits" and
"drop_capabilities" when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f6b6f44d-99f0-4553-ba9c-3d079d2f4f99
📒 Files selected for processing (6)
agents/hermes/start.shdocs/deployment/sandbox-hardening.mdxdocs/security/best-practices.mdxscripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/sandbox-init.test.ts
Sandboxes inherited the Docker daemon default RLIMIT_NOFILE (~1048576), which can exceed the host runtime limit and enable fd-exhaustion DoS, breaking the invariant that the sandbox is no more privileged than the process that launched it. Add a single shared harden_resource_limits() helper in scripts/lib/sandbox-init.sh that caps both nproc (512, NVIDIA#809) and nofile (65536, NVIDIA#4527), setting soft before hard, best-effort (warns but never aborts startup under set -euo pipefail). Both entrypoints (scripts/nemoclaw-start.sh for OpenClaw, agents/hermes/start.sh for Hermes) now call the helper as root PID 1, before the capsh capability drop and setpriv step-down, so the caps are inherited by every descendant and cannot be raised by the unprivileged agent (raising a hard RLIMIT requires CAP_SYS_RESOURCE, which it never holds). Add test coverage (ordered ulimit-shim assertion, best-effort exit-0 + warning assertions, entrypoint grep checks, and a harden-before- drop_capabilities ordering invariant) and parallel nofile documentation in the deployment hardening and security best-practices pages, including the runtime `--ulimit nofile=N:N` override and the 'openshell sandbox connect' residual gap (NVIDIA/OpenShell#1452). Closes NVIDIA#4527 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Tony Luo <xialuo@nvidia.com>
68e7f84 to
04b2c75
Compare
## Summary - Add the v0.0.61 release notes from the GitHub dev announcement. - Document managed vLLM recovery after host reboot and Slack denied-mention feedback. - Refresh generated `nemoclaw-user-*` skills from the source docs. ## Source summary - #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release summary from the dev announcement and linked behavior groups to deeper docs. - #4904 -> `docs/inference/use-local-inference.mdx`: Documented that managed vLLM restarts the `nemoclaw-vllm` container after host reboot during recovery. - #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented Slack sender feedback for denied channel `@mention` events. - #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 -> `.agents/skills/`: Refreshed generated user skills from the current source docs for release prep. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe creation was blocked in the sandbox) - `npm run build:cli` (refreshed local `dist/` for the pre-push TypeScript hook) - Commit and pre-push hooks passed, including docs-to-skills verification, markdownlint, gitleaks, skills YAML tests, and CLI TypeScript. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated sandbox security documentation with file descriptor limits. * Changed default inference model for DGX Station profile. * Enhanced agent policy and backup/restore documentation. * Improved command reference examples with clearer formatting. * Clarified Slack messaging denial notice behavior. * Added automatic vLLM container recovery during host reboot. * Updated release notes for v0.0.61. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Caps the sandbox open-file-descriptor limit (
nofile) at the entrypoint so sandboxes no longer inherit the Docker daemon defaultRLIMIT_NOFILE(~1048576), which can exceed the host runtime limit and enable fd-exhaustion DoS. Restores the invariant that the sandbox is no more privileged than the launching process, hardeningnofilethe same waynprocis already hardened.Related Issue
Closes #4527
Changes
harden_resource_limits()helper inscripts/lib/sandbox-init.shthat capsnproc(512, [SECURITY] No process limit (ulimit -u unlimited) — fork bomb unprotected #809) andnofile(65536, [All Platforms][Security] Sandbox nofile ulimit not capped — inherits Docker daemon default (1M), exceeds host runtime limit #4527), soft-before-hard, best-effort (warns underset -euo pipefail, never aborts startup).nproc-only blocks in both entrypoints (scripts/nemoclaw-start.shfor OpenClaw,agents/hermes/start.shfor Hermes) with a single call to the helper, kept at root PID 1 before the capsh capability drop and setpriv step-down so the caps are inherited by every descendant and are unraisable by the unprivileged agent (raising a hard RLIMIT needsCAP_SYS_RESOURCE, which it never holds).test/sandbox-init.test.ts): orderedulimit-shim assertion (-Su 512/-Hu 512/-Sn 65536/-Hn 65536), best-effort exit-0 +[SECURITY]warning assertions, entrypoint grep checks, and aharden-before-drop_capabilitiesordering invariant.nofilecoverage indocs/deployment/sandbox-hardening.mdxanddocs/security/best-practices.mdx, including the runtime--ulimit nofile=N:Noverride and the honestopenshell sandbox connectresidual gap (feat(sandbox): support --cap-drop / capability scoping on sandbox create OpenShell#1452).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tony Luo xialuo@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
New Feature
Documentation
Tests