Skip to content

fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint#4944

Merged
cv merged 1 commit into
NVIDIA:mainfrom
TonyLuo-NV:fix/4527-cap-sandbox-nofile-ulimit
Jun 8, 2026
Merged

fix(sandbox): cap sandbox open file descriptors (nofile) at entrypoint#4944
cv merged 1 commit into
NVIDIA:mainfrom
TonyLuo-NV:fix/4527-cap-sandbox-nofile-ulimit

Conversation

@TonyLuo-NV

@TonyLuo-NV TonyLuo-NV commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Caps the sandbox open-file-descriptor limit (nofile) at the entrypoint so sandboxes no longer inherit the Docker daemon default RLIMIT_NOFILE (~1048576), which can exceed the host runtime limit and enable fd-exhaustion DoS. Restores the invariant that the sandbox is no more privileged than the launching process, hardening nofile the same way nproc is already hardened.

Related Issue

Closes #4527

Changes

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tony Luo xialuo@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Feature

    • Entrypoints now apply process and open-file descriptor limits earlier during container startup.
  • Documentation

    • Added "Open File Descriptor Limits" and best-practices guidance, plus runtime examples for enforcing/overriding limits.
  • Tests

    • Expanded coverage to validate resource-limit hardening and startup ordering of the hardening step.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7f52b794-df2b-4dbf-83e7-13b83bf55203

📥 Commits

Reviewing files that changed from the base of the PR and between 68e7f84 and 04b2c75.

📒 Files selected for processing (5)
  • agents/hermes/start.sh
  • docs/deployment/sandbox-hardening.mdx
  • docs/security/best-practices.mdx
  • scripts/lib/sandbox-init.sh
  • scripts/nemoclaw-start.sh
💤 Files with no reviewable changes (5)
  • scripts/nemoclaw-start.sh
  • agents/hermes/start.sh
  • docs/deployment/sandbox-hardening.mdx
  • scripts/lib/sandbox-init.sh
  • docs/security/best-practices.mdx

📝 Walkthrough

Walkthrough

This PR centralizes RLIMIT hardening into harden_resource_limits() (sets nproc=512, nofile=65536), invokes it from both entrypoints during PID 1 startup before privilege drop, adds Vitest coverage validating behavior and warnings, and documents the nofile cap and runtime override options.

Changes

Sandbox Resource Limit Hardening

Layer / File(s) Summary
Sandbox resource hardening helper
scripts/lib/sandbox-init.sh
Introduces harden_resource_limits() to set soft/hard nproc (512) and nofile (65536) via ulimit with per-limit [SECURITY] warnings; executes best-effort without blocking startup.
Entrypoint integration
scripts/nemoclaw-start.sh, agents/hermes/start.sh
Both entrypoints now source sandbox-init.sh and call harden_resource_limits as PID 1 before dropping privileges, replacing prior inline ulimit nproc snippets.
Test coverage and verification
test/sandbox-init.test.ts
Adds tests verifying harden_resource_limits invokes ulimit in soft->hard order for nproc and nofile, emits four security warnings on failure, ensures entrypoints delegate before drop_capabilities, and updates the marker used to extract the sandbox-init source block.
Deployment and security documentation
docs/deployment/sandbox-hardening.mdx, docs/security/best-practices.mdx
Document the open file descriptor limit (65536), show --ulimit nofile=65536:65536 and Compose ulimits.nofile examples, explain best-effort behavior and DoS risk if the cap is relaxed, and add a reference to the related issue.
sequenceDiagram
  participant Nemoclaw as scripts/nemoclaw-start.sh
  participant Hermes as agents/hermes/start.sh
  participant Init as scripts/lib/sandbox-init.sh
  participant Harden as harden_resource_limits
  participant Ulimit as ulimit

  Nemoclaw->>Init: source sandbox-init.sh
  Nemoclaw->>Harden: call harden_resource_limits
  Harden->>Ulimit: set nproc soft/hard
  Harden->>Ulimit: set nofile soft/hard
  Harden-->>Nemoclaw: return (best-effort)

  Hermes->>Init: source sandbox-init.sh
  Hermes->>Harden: call harden_resource_limits
  Harden->>Ulimit: set nproc soft/hard
  Harden->>Ulimit: set nofile soft/hard
  Harden-->>Hermes: return (best-effort)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

fix, area: sandbox, area: security, security, v0.0.61

Suggested reviewers

  • laitingsheng
  • prekshivyas
  • cv

Poem

🐰 A helper born in sandbox dreams,
I set the limits, mend the seams,
No more runaway file fright,
Nproc and nofile set just right,
The sandbox naps through starry nights.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: capping sandbox open file descriptors (nofile) at entrypoint, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR implements all key objectives from issue #4527: adds nofile hardening to sandbox startup scripts, implements the cap at entrypoint as root PID 1, provides tests validating the behavior, documents the feature with runtime override guidance, and addresses the security invariant that sandbox should not be more privileged than its parent process.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing nofile hardening: the shared harden_resource_limits() helper, its integration into both entrypoints, comprehensive tests, and documentation updates are all necessary and on-target for resolving issue #4527.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
test/sandbox-init.test.ts (1)

451-473: ⚡ Quick win

Anchor ordering/source-block assertions to executable command lines, not free-text matches.

Current indexOf(...) checks can match comments, which can let regressions slip through or make tests fail on harmless comment edits.

Suggested tightening
- const hardenIdx = src.indexOf("harden_resource_limits");
- const dropIdx = src.indexOf("drop_capabilities");
+ const hardenMatch = src.match(/^\s*harden_resource_limits\s*$/m);
+ const dropMatch = src.match(/^\s*drop_capabilities\b.*$/m);
+ const hardenIdx = hardenMatch?.index ?? -1;
+ const dropIdx = dropMatch?.index ?? -1;

- const end = src.indexOf("# Harden RLIMITs", start);
+ const hardenCallFromStart = src.slice(start).match(/^\s*harden_resource_limits\s*$/m);
+ const end = hardenCallFromStart ? start + (hardenCallFromStart.index ?? 0) : -1;

Also applies to: 642-643

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/sandbox-init.test.ts` around lines 451 - 473, The tests that compute
hardenIdx/dropIdx using src.indexOf(...) can match commented text; update the
assertions to search for executable command lines instead of free-text matches
by locating whole-line, non-comment occurrences of "harden_resource_limits" and
"drop_capabilities" (for example by splitting src into lines and finding the
first line that matches /^\s*harden_resource_limits\b/ and
/^\s*drop_capabilities\b/ or using a multi-line regex that ignores lines
starting with #); then assert those line indices are present and that the harden
line appears before the drop line. Reference the test variables rel and the
symbols "harden_resource_limits" and "drop_capabilities" when making the change.
scripts/nemoclaw-start.sh (1)

70-73: Run the entrypoint E2E set for this PID 1 hardening path.

As per coding guidelines: for scripts/nemoclaw-start.sh, run sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e because these changes are startup-path sensitive and unit tests may not surface runtime lifecycle regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 70 - 73, This change hardens PID 1
startup via the harden_resource_limits call in scripts/nemoclaw-start.sh and
needs the entrypoint E2E test set run to catch lifecycle/startup regressions;
run sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and
openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the
path exercising harden_resource_limits) and report any failures so they can be
fixed before merge.

Source: Coding guidelines

agents/hermes/start.sh (1)

31-33: Run Hermes E2Es for this startup-path hardening change.

As per coding guidelines: for agents/hermes/**, run hermes-e2e, hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e, rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e to validate onboarding, health, and routing behavior after entrypoint changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/hermes/start.sh` around lines 31 - 33, Run the hermes E2E suites to
validate the startup-path change to harden_resource_limits in
agents/hermes/start.sh: execute hermes-e2e, hermes-inference-switch-e2e,
hermes-discord-e2e, hermes-slack-e2e, rebuild-hermes-e2e, and
rebuild-hermes-stale-base-e2e; confirm onboarding, health checks, and routing
behavior pass and inspect logs for any startup/permission failures—if any tests
fail, revert or adjust the harden_resource_limits invocation in start.sh and
re-run the failing suites.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@agents/hermes/start.sh`:
- Around line 31-33: Run the hermes E2E suites to validate the startup-path
change to harden_resource_limits in agents/hermes/start.sh: execute hermes-e2e,
hermes-inference-switch-e2e, hermes-discord-e2e, hermes-slack-e2e,
rebuild-hermes-e2e, and rebuild-hermes-stale-base-e2e; confirm onboarding,
health checks, and routing behavior pass and inspect logs for any
startup/permission failures—if any tests fail, revert or adjust the
harden_resource_limits invocation in start.sh and re-run the failing suites.

In `@scripts/nemoclaw-start.sh`:
- Around line 70-73: This change hardens PID 1 startup via the
harden_resource_limits call in scripts/nemoclaw-start.sh and needs the
entrypoint E2E test set run to catch lifecycle/startup regressions; run
sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and
openclaw-slack-pairing-e2e against the modified startup image/entrypoint (the
path exercising harden_resource_limits) and report any failures so they can be
fixed before merge.

In `@test/sandbox-init.test.ts`:
- Around line 451-473: The tests that compute hardenIdx/dropIdx using
src.indexOf(...) can match commented text; update the assertions to search for
executable command lines instead of free-text matches by locating whole-line,
non-comment occurrences of "harden_resource_limits" and "drop_capabilities" (for
example by splitting src into lines and finding the first line that matches
/^\s*harden_resource_limits\b/ and /^\s*drop_capabilities\b/ or using a
multi-line regex that ignores lines starting with #); then assert those line
indices are present and that the harden line appears before the drop line.
Reference the test variables rel and the symbols "harden_resource_limits" and
"drop_capabilities" when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6b6f44d-99f0-4553-ba9c-3d079d2f4f99

📥 Commits

Reviewing files that changed from the base of the PR and between ec408c8 and 68e7f84.

📒 Files selected for processing (6)
  • agents/hermes/start.sh
  • docs/deployment/sandbox-hardening.mdx
  • docs/security/best-practices.mdx
  • scripts/lib/sandbox-init.sh
  • scripts/nemoclaw-start.sh
  • test/sandbox-init.test.ts

Sandboxes inherited the Docker daemon default RLIMIT_NOFILE (~1048576),
which can exceed the host runtime limit and enable fd-exhaustion DoS,
breaking the invariant that the sandbox is no more privileged than the
process that launched it.

Add a single shared harden_resource_limits() helper in
scripts/lib/sandbox-init.sh that caps both nproc (512, NVIDIA#809) and nofile
(65536, NVIDIA#4527), setting soft before hard, best-effort (warns but never
aborts startup under set -euo pipefail). Both entrypoints
(scripts/nemoclaw-start.sh for OpenClaw, agents/hermes/start.sh for
Hermes) now call the helper as root PID 1, before the capsh capability
drop and setpriv step-down, so the caps are inherited by every
descendant and cannot be raised by the unprivileged agent (raising a
hard RLIMIT requires CAP_SYS_RESOURCE, which it never holds).

Add test coverage (ordered ulimit-shim assertion, best-effort exit-0 +
warning assertions, entrypoint grep checks, and a harden-before-
drop_capabilities ordering invariant) and parallel nofile documentation
in the deployment hardening and security best-practices pages, including
the runtime `--ulimit nofile=N:N` override and the 'openshell sandbox
connect' residual gap (NVIDIA/OpenShell#1452).

Closes NVIDIA#4527

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Tony Luo <xialuo@nvidia.com>
@TonyLuo-NV TonyLuo-NV force-pushed the fix/4527-cap-sandbox-nofile-ulimit branch from 68e7f84 to 04b2c75 Compare June 8, 2026 09:45
@TonyLuo-NV TonyLuo-NV added the v0.0.61 Release target label Jun 8, 2026
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery area: security Security controls, permissions, secrets, or hardening bug-fix PR fixes a bug or regression labels Jun 8, 2026
@cv cv merged commit ce74e7f into NVIDIA:main Jun 8, 2026
28 checks passed
miyoungc added a commit that referenced this pull request Jun 8, 2026
## Summary
- Add the v0.0.61 release notes from the GitHub dev announcement.
- Document managed vLLM recovery after host reboot and Slack
denied-mention feedback.
- Refresh generated `nemoclaw-user-*` skills from the source docs.

## Source summary
- #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release
summary from the dev announcement and linked behavior groups to deeper
docs.
- #4904 -> `docs/inference/use-local-inference.mdx`: Documented that
managed vLLM restarts the `nemoclaw-vllm` container after host reboot
during recovery.
- #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented
Slack sender feedback for denied channel `@mention` events.
- #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 ->
`.agents/skills/`: Refreshed generated user skills from the current
source docs for release prep.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe
creation was blocked in the sandbox)
- `npm run build:cli` (refreshed local `dist/` for the pre-push
TypeScript hook)
- Commit and pre-push hooks passed, including docs-to-skills
verification, markdownlint, gitleaks, skills YAML tests, and CLI
TypeScript.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
  * Updated sandbox security documentation with file descriptor limits.
  * Changed default inference model for DGX Station profile.
  * Enhanced agent policy and backup/restore documentation.
  * Improved command reference examples with clearer formatting.
  * Clarified Slack messaging denial notice behavior.
  * Added automatic vLLM container recovery during host reboot.
  * Updated release notes for v0.0.61.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@TonyLuo-NV TonyLuo-NV deleted the fix/4527-cap-sandbox-nofile-ulimit branch June 9, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery area: security Security controls, permissions, secrets, or hardening bug-fix PR fixes a bug or regression v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Security] Sandbox nofile ulimit not capped — inherits Docker daemon default (1M), exceeds host runtime limit

3 participants