fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint by TonyLuo-NV · Pull Request #4944 · NVIDIA/NemoClaw

TonyLuo-NV · 2026-06-08T09:28:33Z

Summary

Caps the sandbox open-file-descriptor limit (nofile) at the entrypoint so sandboxes no longer inherit the Docker daemon default RLIMIT_NOFILE (~1048576), which can exceed the host runtime limit and enable fd-exhaustion DoS. Restores the invariant that the sandbox is no more privileged than the launching process, hardening nofile the same way nproc is already hardened.

Related Issue

Closes #4527

Changes

Add a shared harden_resource_limits() helper in scripts/lib/sandbox-init.sh that caps nproc (512, [SECURITY] No process limit (ulimit -u unlimited) — fork bomb unprotected #809) and nofile (65536, [All Platforms][Security] Sandbox nofile ulimit not capped — inherits Docker daemon default (1M), exceeds host runtime limit #4527), soft-before-hard, best-effort (warns under set -euo pipefail, never aborts startup).
Replace the duplicated inline nproc-only blocks in both entrypoints (scripts/nemoclaw-start.sh for OpenClaw, agents/hermes/start.sh for Hermes) with a single call to the helper, kept at root PID 1 before the capsh capability drop and setpriv step-down so the caps are inherited by every descendant and are unraisable by the unprivileged agent (raising a hard RLIMIT needs CAP_SYS_RESOURCE, which it never holds).
Tests (test/sandbox-init.test.ts): ordered ulimit-shim assertion (-Su 512/-Hu 512/-Sn 65536/-Hn 65536), best-effort exit-0 + [SECURITY] warning assertions, entrypoint grep checks, and a harden-before-drop_capabilities ordering invariant.
Docs: parallel nofile coverage in docs/deployment/sandbox-hardening.mdx and docs/security/best-practices.mdx, including the runtime --ulimit nofile=N:N override and the honest openshell sandbox connect residual gap (feat(sandbox): support --cap-drop / capability scoping on sandbox create OpenShell#1452).

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tony Luo xialuo@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

New Feature
- Entrypoints now apply process and open-file descriptor limits earlier during container startup.
Documentation
- Added "Open File Descriptor Limits" and best-practices guidance, plus runtime examples for enforcing/overriding limits.
Tests
- Expanded coverage to validate resource-limit hardening and startup ordering of the hardening step.

coderabbitai · 2026-06-08T09:28:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7f52b794-df2b-4dbf-83e7-13b83bf55203

📥 Commits

Reviewing files that changed from the base of the PR and between 68e7f84 and 04b2c75.

📒 Files selected for processing (5)

agents/hermes/start.sh
docs/deployment/sandbox-hardening.mdx
docs/security/best-practices.mdx
scripts/lib/sandbox-init.sh
scripts/nemoclaw-start.sh

💤 Files with no reviewable changes (5)

scripts/nemoclaw-start.sh
agents/hermes/start.sh
docs/deployment/sandbox-hardening.mdx
scripts/lib/sandbox-init.sh
docs/security/best-practices.mdx

📝 Walkthrough

Walkthrough

This PR centralizes RLIMIT hardening into harden_resource_limits() (sets nproc=512, nofile=65536), invokes it from both entrypoints during PID 1 startup before privilege drop, adds Vitest coverage validating behavior and warnings, and documents the nofile cap and runtime override options.

Changes

Sandbox Resource Limit Hardening

Layer / File(s)	Summary
Sandbox resource hardening helper `scripts/lib/sandbox-init.sh`	Introduces `harden_resource_limits()` to set soft/hard `nproc` (512) and `nofile` (65536) via `ulimit` with per-limit `[SECURITY]` warnings; executes best-effort without blocking startup.
Entrypoint integration `scripts/nemoclaw-start.sh`, `agents/hermes/start.sh`	Both entrypoints now source `sandbox-init.sh` and call `harden_resource_limits` as PID 1 before dropping privileges, replacing prior inline `ulimit` nproc snippets.
Test coverage and verification `test/sandbox-init.test.ts`	Adds tests verifying `harden_resource_limits` invokes `ulimit` in soft->hard order for nproc and nofile, emits four security warnings on failure, ensures entrypoints delegate before `drop_capabilities`, and updates the marker used to extract the sandbox-init source block.
Deployment and security documentation `docs/deployment/sandbox-hardening.mdx`, `docs/security/best-practices.mdx`	Document the open file descriptor limit (65536), show `--ulimit nofile=65536:65536` and Compose `ulimits.nofile` examples, explain best-effort behavior and DoS risk if the cap is relaxed, and add a reference to the related issue.

sequenceDiagram
  participant Nemoclaw as scripts/nemoclaw-start.sh
  participant Hermes as agents/hermes/start.sh
  participant Init as scripts/lib/sandbox-init.sh
  participant Harden as harden_resource_limits
  participant Ulimit as ulimit

  Nemoclaw->>Init: source sandbox-init.sh
  Nemoclaw->>Harden: call harden_resource_limits
  Harden->>Ulimit: set nproc soft/hard
  Harden->>Ulimit: set nofile soft/hard
  Harden-->>Nemoclaw: return (best-effort)

  Hermes->>Init: source sandbox-init.sh
  Hermes->>Harden: call harden_resource_limits
  Harden->>Ulimit: set nproc soft/hard
  Harden->>Ulimit: set nofile soft/hard
  Harden-->>Hermes: return (best-effort)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

fix, area: sandbox, area: security, security, v0.0.61

Suggested reviewers

laitingsheng
prekshivyas
cv

Poem

🐰 A helper born in sandbox dreams,
I set the limits, mend the seams,
No more runaway file fright,
Nproc and nofile set just right,
The sandbox naps through starry nights.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: capping sandbox open file descriptors (nofile) at entrypoint, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR implements all key objectives from issue `#4527`: adds nofile hardening to sandbox startup scripts, implements the cap at entrypoint as root PID 1, provides tests validating the behavior, documents the feature with runtime override guidance, and addresses the security invariant that sandbox should not be more privileged than its parent process.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing nofile hardening: the shared harden_resource_limits() helper, its integration into both entrypoints, comprehensive tests, and documentation updates are all necessary and on-target for resolving issue `#4527`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

test/sandbox-init.test.ts (1)

451-473: ⚡ Quick win

Anchor ordering/source-block assertions to executable command lines, not free-text matches.

Current indexOf(...) checks can match comments, which can let regressions slip through or make tests fail on harmless comment edits.

Suggested tightening

- const hardenIdx = src.indexOf("harden_resource_limits");
- const dropIdx = src.indexOf("drop_capabilities");
+ const hardenMatch = src.match(/^\s*harden_resource_limits\s*$/m);
+ const dropMatch = src.match(/^\s*drop_capabilities\b.*$/m);
+ const hardenIdx = hardenMatch?.index ?? -1;
+ const dropIdx = dropMatch?.index ?? -1;

- const end = src.indexOf("# Harden RLIMITs", start);
+ const hardenCallFromStart = src.slice(start).match(/^\s*harden_resource_limits\s*$/m);
+ const end = hardenCallFromStart ? start + (hardenCallFromStart.index ?? 0) : -1;

Also applies to: 642-643

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/sandbox-init.test.ts` around lines 451 - 473, The tests that compute
hardenIdx/dropIdx using src.indexOf(...) can match commented text; update the
assertions to search for executable command lines instead of free-text matches
by locating whole-line, non-comment occurrences of "harden_resource_limits" and
"drop_capabilities" (for example by splitting src into lines and finding the
first line that matches /^\s*harden_resource_limits\b/ and
/^\s*drop_capabilities\b/ or using a multi-line regex that ignores lines
starting with #); then assert those line indices are present and that the harden
line appears before the drop line. Reference the test variables rel and the
symbols "harden_resource_limits" and "drop_capabilities" when making the change.

scripts/nemoclaw-start.sh (1)

70-73: Run the entrypoint E2E set for this PID 1 hardening path.

As per coding guidelines: for scripts/nemoclaw-start.sh, run sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e because these changes are startup-path sensitive and unit tests may not surface runtime lifecycle regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 70 - 73, This change hardens PID 1
startup via the harden_resource_limits call in scripts/nemoclaw-start.sh and
needs the entrypoint E2E test set run to catch lifecycle/startup regressions;
run sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and
openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the
path exercising harden_resource_limits) and report any failures so they can be
fixed before merge.
Source: Coding guidelines

agents/hermes/start.sh (1)

31-33: Run Hermes E2Es for this startup-path hardening change.

As per coding guidelines: for agents/hermes/**, run hermes-e2e, hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e, rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e to validate onboarding, health, and routing behavior after entrypoint changes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/hermes/start.sh` around lines 31 - 33, Run the hermes E2E suites to
validate the startup-path change to harden_resource_limits in
agents/hermes/start.sh: execute hermes-e2e, hermes-inference-switch-e2e,
hermes-discord-e2e, hermes-slack-e2e, rebuild-hermes-e2e, and
rebuild-hermes-stale-base-e2e; confirm onboarding, health checks, and routing
behavior pass and inspect logs for any startup/permission failures—if any tests
fail, revert or adjust the harden_resource_limits invocation in start.sh and
re-run the failing suites.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@agents/hermes/start.sh`:
- Around line 31-33: Run the hermes E2E suites to validate the startup-path
change to harden_resource_limits in agents/hermes/start.sh: execute hermes-e2e,
hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e,
rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e; confirm onboarding,
health checks, and routing behavior pass and inspect logs for any
startup/permission failures—if any tests fail, revert or adjust the
harden_resource_limits invocation in start.sh and re-run the failing suites.

In `@scripts/nemoclaw-start.sh`:
- Around line 70-73: This change hardens PID 1 startup via the
harden_resource_limits call in scripts/nemoclaw-start.sh and needs the
entrypoint E2E test set run to catch lifecycle/startup regressions; run
sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and
openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the
path exercising harden_resource_limits) and report any failures so they can be
fixed before merge.

In `@test/sandbox-init.test.ts`:
- Around line 451-473: The tests that compute hardenIdx/dropIdx using
src.indexOf(...) can match commented text; update the assertions to search for
executable command lines instead of free-text matches by locating whole-line,
non-comment occurrences of "harden_resource_limits" and "drop_capabilities" (for
example by splitting src into lines and finding the first line that matches
/^\s*harden_resource_limits\b/ and /^\s*drop_capabilities\b/ or using a
multi-line regex that ignores lines starting with #); then assert those line
indices are present and that the harden line appears before the drop line.
Reference the test variables rel and the symbols "harden_resource_limits" and
"drop_capabilities" when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6b6f44d-99f0-4553-ba9c-3d079d2f4f99

📥 Commits

Reviewing files that changed from the base of the PR and between ec408c8 and 68e7f84.

📒 Files selected for processing (6)

agents/hermes/start.sh
docs/deployment/sandbox-hardening.mdx
docs/security/best-practices.mdx
scripts/lib/sandbox-init.sh
scripts/nemoclaw-start.sh
test/sandbox-init.test.ts

Sandboxes inherited the Docker daemon default RLIMIT_NOFILE (~1048576), which can exceed the host runtime limit and enable fd-exhaustion DoS, breaking the invariant that the sandbox is no more privileged than the process that launched it. Add a single shared harden_resource_limits() helper in scripts/lib/sandbox-init.sh that caps both nproc (512, NVIDIA#809) and nofile (65536, NVIDIA#4527), setting soft before hard, best-effort (warns but never aborts startup under set -euo pipefail). Both entrypoints (scripts/nemoclaw-start.sh for OpenClaw, agents/hermes/start.sh for Hermes) now call the helper as root PID 1, before the capsh capability drop and setpriv step-down, so the caps are inherited by every descendant and cannot be raised by the unprivileged agent (raising a hard RLIMIT requires CAP_SYS_RESOURCE, which it never holds). Add test coverage (ordered ulimit-shim assertion, best-effort exit-0 + warning assertions, entrypoint grep checks, and a harden-before- drop_capabilities ordering invariant) and parallel nofile documentation in the deployment hardening and security best-practices pages, including the runtime `--ulimit nofile=N:N` override and the 'openshell sandbox connect' residual gap (NVIDIA/OpenShell#1452). Closes NVIDIA#4527 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Tony Luo <xialuo@nvidia.com>

wscurran · 2026-06-08T14:27:12Z

✨
Related open issues:

#4527 [All Platforms][Security] Sandbox nofile ulimit not capped — inherits Docker daemon default (1M), exceeds host runtime limit
#809 [SECURITY] No process limit (ulimit -u unlimited) — fork bomb unprotected
#1452 [Bug] Agent fabricates successful execution output after tool failure (exec: python not found)

## Summary - Add the v0.0.61 release notes from the GitHub dev announcement. - Document managed vLLM recovery after host reboot and Slack denied-mention feedback. - Refresh generated `nemoclaw-user-*` skills from the source docs. ## Source summary - #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release summary from the dev announcement and linked behavior groups to deeper docs. - #4904 -> `docs/inference/use-local-inference.mdx`: Documented that managed vLLM restarts the `nemoclaw-vllm` container after host reboot during recovery. - #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented Slack sender feedback for denied channel `@mention` events. - #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 -> `.agents/skills/`: Refreshed generated user skills from the current source docs for release prep. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe creation was blocked in the sandbox) - `npm run build:cli` (refreshed local `dist/` for the pre-push TypeScript hook) - Commit and pre-push hooks passed, including docs-to-skills verification, markdownlint, gitleaks, skills YAML tests, and CLI TypeScript.  ## Summary by CodeRabbit * **Documentation** * Updated sandbox security documentation with file descriptor limits. * Changed default inference model for DGX Station profile. * Enhanced agent policy and backup/restore documentation. * Improved command reference examples with clearer formatting. * Clarified Slack messaging denial notice behavior. * Added automatic vLLM container recovery during host reboot. * Updated release notes for v0.0.61.

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

TonyLuo-NV force-pushed the fix/4527-cap-sandbox-nofile-ulimit branch from 68e7f84 to 04b2c75 Compare June 8, 2026 09:45

TonyLuo-NV added the v0.0.61 Release target label Jun 8, 2026

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery area: security Security controls, permissions, secrets, or hardening bug-fix PR fixes a bug or regression labels Jun 8, 2026

cv approved these changes Jun 8, 2026

View reviewed changes

cv merged commit ce74e7f into NVIDIA:main Jun 8, 2026
28 checks passed

miyoungc mentioned this pull request Jun 8, 2026

docs: refresh v0.0.61 release docs #4992

Merged

TonyLuo-NV deleted the fix/4527-cap-sandbox-nofile-ulimit branch June 9, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint#4944

fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint#4944
cv merged 1 commit into
NVIDIA:mainfrom
TonyLuo-NV:fix/4527-cap-sandbox-nofile-ulimit

TonyLuo-NV commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TonyLuo-NV commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TonyLuo-NV commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading