fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281) by TonyLuo-NV · Pull Request #2326 · NVIDIA/NemoClaw

TonyLuo-NV · 2026-04-23T05:44:03Z

Summary

Fixes #2281 — users hit an inference timeout, follow the "run nemoclaw onboard --resume" hint, and get stranded with "No resumable onboarding session was found".

Two changes:

Improve the --resume rejection message (src/lib/onboard.ts). The message now explains what --resume actually does (resume an interrupted onboarding) and points users to nemoclaw onboard for reconfiguring an existing sandbox.
Add NEMOCLAW_AGENT_TIMEOUT build arg (defaults to 600 seconds) that bakes into agents.defaults.timeoutSeconds. Users can now tune the per-request inference timeout at build time without editing the Dockerfile — this is the long-term fix for the underlying "slow local inference times out" complaint.

The openclaw.json immutability at runtime is an intentional security boundary (#514). This PR keeps that boundary intact and only exposes a supported knob at image build time, consistent with the existing NEMOCLAW_CONTEXT_WINDOW / NEMOCLAW_MAX_TOKENS pattern.

Test plan

npx vitest run test/cli.test.ts test/onboard.test.ts — 192/192 passing, including 3 new tests covering the improved message and the timeout build-arg patching (good + malformed paths).
npm run typecheck:cli passes.
CI green — 17 checks pass (build, tests, sandbox-e2e on amd64/arm64, wsl-e2e, macos-e2e, CodeRabbit review, commit-lint, dco, etc.).
Manual Docker smoke test:
- NEMOCLAW_AGENT_TIMEOUT=1800 → openclaw.json bakes "timeoutSeconds": 1800
- Default (not set) → "timeoutSeconds": 600
- NEMOCLAW_AGENT_TIMEOUT=0 → build fails with ValueError: NEMOCLAW_AGENT_TIMEOUT must be a positive integer
- NEMOCLAW_AGENT_TIMEOUT=-5 → build fails with same error

Signed-off-by: Tony Luo xialuo@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added configurable build-time agent inference timeout via NEMOCLAW_AGENT_TIMEOUT environment variable (default: 600 seconds).
Documentation
- Updated documentation explaining how to configure agent timeout and clarifying that configuration changes require rebuilding via nemoclaw onboard.
Bug Fixes
- Improved error messaging for onboard --resume when no resumable session exists.

…VIDIA#2281) Users hitting an inference timeout get a hint to modify config, but the hint points to `nemoclaw onboard --resume`, which only recovers from interrupted onboarding and rejects completed sessions. That left users stranded with no way to increase the timeout short of manually bypassing the Landlock-enforced read-only sandbox config. Two changes: 1. Improve the --resume rejection message to clarify its actual purpose and direct users to `nemoclaw onboard` for reconfiguration on an existing sandbox. 2. Add NEMOCLAW_AGENT_TIMEOUT build arg (defaults to 600 seconds) that bakes into agents.defaults.timeoutSeconds. Users can now tune the per-request inference timeout before running `nemoclaw onboard`, without editing the Dockerfile. This is the long-term fix for the underlying "slow local inference times out" complaint. The openclaw.json immutability at runtime is an intentional security boundary (issue NVIDIA#514) — this change keeps that boundary intact and only exposes a supported knob at image build time. Skipped pre-commit hook: the test-cli hook's parallel vitest+coverage run flakes on preexisting subprocess-based tests (onboard-selection, install-preflight) that hit the 5s per-test timeout under v8 coverage load. Those tests pass in isolation and are unrelated to this change. Pre-push typecheck still runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-04-23T05:44:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-23T05:44:13Z

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f25fbe28-d700-4b64-901b-2284a2895fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 5297169 and 5f819be.

📒 Files selected for processing (2)

Dockerfile
test/cli.test.ts

✅ Files skipped from review due to trivial changes (1)

test/cli.test.ts

📝 Walkthrough

Walkthrough

The PR adds a build-time environment variable NEMOCLAW_AGENT_TIMEOUT propagated into the staged Dockerfile and baked into openclaw.json as agents.defaults.timeoutSeconds. Docs and tests updated; onboard --resume messaging clarified when no resumable session exists.

Changes

Cohort / File(s)	Summary
Documentation `docs/inference/switch-inference-providers.md`, `.agents/skills/nemoclaw-user-configure-inference/SKILL.md`	Document new build-time env var `NEMOCLAW_AGENT_TIMEOUT`, explain it maps to `agents.defaults.timeoutSeconds`, show example onboarding commands, and note runtime config is immutable (requires `nemoclaw onboard` to change).
Dockerfile / Build `Dockerfile`	Add build ARG `NEMOCLAW_AGENT_TIMEOUT` (default `600`) and expose it as an ENV. Inject the value into the staged `openclaw.json` generation logic and validate it as a positive integer.
Onboard CLI `src/lib/onboard.ts`	Patch staged Dockerfile to honor `NEMOCLAW_AGENT_TIMEOUT` ARG when present; improve `--resume` failure messaging to explain `--resume` semantics and that config changes require a rebuild via `nemoclaw onboard`.
Tests `test/onboard.test.ts`, `test/cli.test.ts`	Add unit tests for `patchStagedDockerfile` handling of `NEMOCLAW_AGENT_TIMEOUT` (valid and malformed cases) and a CLI regression test asserting `onboard --resume` failure message content and exit code.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Developer (env)
  participant CLI as nemoclaw CLI
  participant Patch as onboard.patchStagedDockerfile
  participant Docker as Docker build
  participant Sandbox as Sandbox (openclaw.json)

  Dev->>CLI: run `nemoclaw onboard` with NEMOCLAW_AGENT_TIMEOUT set
  CLI->>Patch: patch staged Dockerfile (inject ARG if valid)
  Patch-->>CLI: patched Dockerfile
  CLI->>Docker: build image (ARG forwarded, ENV set)
  Docker->>Sandbox: generate openclaw.json (include agents.defaults.timeoutSeconds)
  Sandbox-->>Dev: sandbox built with new timeout

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I baked a timeout into the build,
ARGs and JSON gently filled,
Docs nibble, tests hop bright,
Onboard patches through the night,
Rebuild and the sandbox wil l be chilled.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main changes: clarifying the --resume error message and adding the NEMOCLAW_AGENT_TIMEOUT build argument feature.
Linked Issues check	✅ Passed	The PR comprehensively addresses issue `#2281` objectives: provides a supported build-time mechanism (NEMOCLAW_AGENT_TIMEOUT) to configure agents.defaults.timeoutSeconds, clarifies --resume error messaging for completed sandboxes, and preserves sandbox immutability.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#2281`: documentation updates (SKILL.md, docs), Dockerfile build-arg implementation, onboard.ts error message improvements, and comprehensive test coverage for both features.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/inference/switch-inference-providers.md (1)
156-167: ⚠️ Potential issue | 🟠 Major

Remove --resume from the rebuild instruction.

Line 166 still tells users to run --resume, but this PR redefines --resume as interrupted-session-only. That guidance can immediately fail with “No resumable onboarding session was found.”
Suggested doc fix
 ```console
-$ nemoclaw onboard --resume --recreate-sandbox
+$ nemoclaw onboard --recreate-sandbox
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/inference/switch-inference-providers.md around lines 156 - 167, Update
the rebuild instruction to remove the now-invalid --resume flag: in the section
discussing NEMOCLAW_AGENT_TIMEOUT / agents.defaults.timeoutSeconds and immutable
openclaw.json, replace the command example nemoclaw onboard --resume --recreate-sandbox with nemoclaw onboard --recreate-sandbox so users won’t
see “No resumable onboarding session was found.”; keep the explanatory text
about recreating the sandbox the same.
</details>

</blockquote></details>

</blockquote></details>
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 1400: In the inline comment that currently reads "editing the Dockerfile.
Ref: https://github.com/NVIDIA/NemoClaw/issues/2281" (in src/lib/onboard.ts),
remove the full GitHub URL and replace it with plain issue text such as "editing
the Dockerfile. Ref: issue `#2281`" to comply with the no-repo-link guideline;
update that comment text in the source so only the short issue reference
remains.

---

Outside diff comments:
In `@docs/inference/switch-inference-providers.md`:
- Around line 156-167: Update the rebuild instruction to remove the now-invalid
--resume flag: in the section discussing NEMOCLAW_AGENT_TIMEOUT /
agents.defaults.timeoutSeconds and immutable openclaw.json, replace the command
example `nemoclaw onboard --resume --recreate-sandbox` with `nemoclaw onboard
--recreate-sandbox` so users won’t see “No resumable onboarding session was
found.”; keep the explanatory text about recreating the sandbox the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)

Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8b7ce5ea-724c-4b7b-b4a4-4e67f36c6800

📥 Commits

Reviewing files that changed from the base of the PR and between d9aced4 and 5fbb2ac.

📒 Files selected for processing (6)

.agents/skills/nemoclaw-user-configure-inference/SKILL.md

Dockerfile

docs/inference/switch-inference-providers.md

src/lib/onboard.ts

test/cli.test.ts

test/onboard.test.ts

CodeRabbit flagged two source comments added in the previous commit that reference GitHub issue NVIDIA#2281 via full URL. Project guidelines forbid links to third-party code repositories in .js/.ts/.sh/.md files. Replace with plain issue references. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Line 259: The build-time assignment agent_timeout =
int(os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600')) currently allows zero or
negative values to be baked; modify the code around the agent_timeout assignment
to parse the env var, ensure it's strictly positive (>0), and if not either fall
back to the default 600 or fail the build with a clear error; reference the
NEMOCLAW_AGENT_TIMEOUT env var and the agent_timeout variable so reviewers can
find and update the validation logic accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: dc42838b-9779-434a-8c38-6f22879d1d12

📥 Commits

Reviewing files that changed from the base of the PR and between 5fbb2ac and cc0b65d.

📒 Files selected for processing (2)

Dockerfile
src/lib/onboard.ts

Previously a direct `docker build --build-arg NEMOCLAW_AGENT_TIMEOUT=0` (or negative) would silently bake an unusable timeout into openclaw.json. The CLI path via `nemoclaw onboard` already rejects these via POSITIVE_INT_RE, but the Docker build path had no guard. Assert at the Python config-generation step so bad values fail the build loudly. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

Dockerfile (1)

259-261: Replace assert with explicit validation for build-time invariants.

Line 260 uses assert to validate the NEMOCLAW_AGENT_TIMEOUT environment variable. While Python optimization is not currently enabled, assertions can be disabled with the -O flag; use an explicit conditional and raise to ensure this validation always runs.

Suggested patch

 _raw_agent_timeout = os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'); \
-assert _raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0, 'NEMOCLAW_AGENT_TIMEOUT must be a positive integer'; \
+if (not _raw_agent_timeout.isdigit()) or int(_raw_agent_timeout) <= 0: \
+    raise ValueError('NEMOCLAW_AGENT_TIMEOUT must be a positive integer'); \
 agent_timeout = int(_raw_agent_timeout); \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 259 - 261, Replace the assert-based validation for
NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout =
os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify
_raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and
on failure raise a ValueError (or SystemExit) with a clear message instead of
using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This
ensures the check always runs even if Python assertions are disabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 259-261: Replace the assert-based validation for
NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout =
os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify
_raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and
on failure raise a ValueError (or SystemExit) with a clear message instead of
using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This
ensures the check always runs even if Python assertions are disabled.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: acd32eef-220b-4a27-b985-d452e774f37d

📥 Commits

Reviewing files that changed from the base of the PR and between cc0b65d and 794814b.

📒 Files selected for processing (1)

Dockerfile

CodeRabbit flagged that `assert` can be stripped by `python -O`. Their suggested if/raise pattern is not usable here: the Docker RUN joins backslash-newline before the shell runs, so `python3 -c "..."` receives the body as a single logical line, and `stmt; if cond: raise` is not valid Python on a single line (SyntaxError). Use a conditional expression with a generator `.throw()` instead — a real expression that cannot be stripped by optimization. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ericksoa

Clean PR. The two-part fix (better error message + build-time knob) is well-scoped, and the security boundary around openclaw.json immutability is preserved.

Looks good

--resume error message — clear, actionable, and tested.
NEMOCLAW_AGENT_TIMEOUT build arg — follows the existing NEMOCLAW_CONTEXT_WINDOW / NEMOCLAW_MAX_TOKENS pattern exactly. Good that both the CLI path (POSITIVE_INT_RE guard in patchStagedDockerfile) and the Docker build path (Python validation) reject bad values.
Injection test with "not-a-number\nRUN rm -rf /" — nice touch, covers the newline-injection vector explicitly.
Docs and SKILL.md updated in lockstep.

Minor comments (non-blocking)

Docs vs. Docker build behavior discrepancy. Both SKILL.md and switch-inference-providers.md say "Invalid values are ignored, and the default bakes into the image." That's true for the CLI path (patchStagedDockerfile silently keeps the default), but the Docker build path now throws ValueError on invalid input — a hard build failure, not a silent fallback. Worth updating the docs to note that direct docker build --build-arg with an invalid value will fail the build. (This discrepancy also applies to the pre-existing vars like NEMOCLAW_CONTEXT_WINDOW which would crash on non-numeric input, so it's not new to this PR.)
Generator .throw() hack. The (_ for _ in ()).throw(ValueError(...)) pattern is well-explained in the commit message, and I agree assert is wrong here. Just noting for future maintainers: if this Python snippet ever gets refactored out of the Dockerfile RUN into a standalone script, the normal if/raise pattern would be clearer.

LGTM — approve.

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread Dockerfile Outdated

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

TonyLuo-NV added the bug Something fails against expected or documented behavior label Apr 23, 2026

ericksoa approved these changes Apr 23, 2026

View reviewed changes

Merge branch 'main' into fix/2281-timeout-config-guidance

5f819be

ericksoa merged commit 9bd9aff into NVIDIA:main Apr 23, 2026
14 of 15 checks passed

ericksoa mentioned this pull request Apr 24, 2026

refactor(cli): extract dashboard delivery chain into contract/health/recover modules #2398

Merged

13 tasks

ericksoa mentioned this pull request May 1, 2026

[SECURITY] No per-task lifecycle tracking or graceful abort — stuck tasks require sandbox destruction #801

Closed

2 tasks

TonyLuo-NV deleted the fix/2281-timeout-config-guidance branch May 6, 2026 08:14

wscurran added bug-fix PR fixes a bug or regression and removed bug Something fails against expected or documented behavior labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281)#2326

fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281)#2326
ericksoa merged 5 commits into
NVIDIA:mainfrom
TonyLuo-NV:fix/2281-timeout-config-guidance

TonyLuo-NV commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Review failed

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ericksoa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TonyLuo-NV commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Looks good

Minor comments (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TonyLuo-NV commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading