Skip to content

fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281)#2326

Merged
ericksoa merged 5 commits into
NVIDIA:mainfrom
TonyLuo-NV:fix/2281-timeout-config-guidance
Apr 23, 2026
Merged

fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281)#2326
ericksoa merged 5 commits into
NVIDIA:mainfrom
TonyLuo-NV:fix/2281-timeout-config-guidance

Conversation

@TonyLuo-NV

@TonyLuo-NV TonyLuo-NV commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2281 — users hit an inference timeout, follow the "run nemoclaw onboard --resume" hint, and get stranded with "No resumable onboarding session was found".

Two changes:

  1. Improve the --resume rejection message (src/lib/onboard.ts). The message now explains what --resume actually does (resume an interrupted onboarding) and points users to nemoclaw onboard for reconfiguring an existing sandbox.

  2. Add NEMOCLAW_AGENT_TIMEOUT build arg (defaults to 600 seconds) that bakes into agents.defaults.timeoutSeconds. Users can now tune the per-request inference timeout at build time without editing the Dockerfile — this is the long-term fix for the underlying "slow local inference times out" complaint.

The openclaw.json immutability at runtime is an intentional security boundary (#514). This PR keeps that boundary intact and only exposes a supported knob at image build time, consistent with the existing NEMOCLAW_CONTEXT_WINDOW / NEMOCLAW_MAX_TOKENS pattern.

Test plan

  • npx vitest run test/cli.test.ts test/onboard.test.ts — 192/192 passing, including 3 new tests covering the improved message and the timeout build-arg patching (good + malformed paths).
  • npm run typecheck:cli passes.
  • CI green — 17 checks pass (build, tests, sandbox-e2e on amd64/arm64, wsl-e2e, macos-e2e, CodeRabbit review, commit-lint, dco, etc.).
  • Manual Docker smoke test:
    • NEMOCLAW_AGENT_TIMEOUT=1800openclaw.json bakes "timeoutSeconds": 1800
    • Default (not set) → "timeoutSeconds": 600
    • NEMOCLAW_AGENT_TIMEOUT=0 → build fails with ValueError: NEMOCLAW_AGENT_TIMEOUT must be a positive integer
    • NEMOCLAW_AGENT_TIMEOUT=-5 → build fails with same error

Signed-off-by: Tony Luo xialuo@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable build-time agent inference timeout via NEMOCLAW_AGENT_TIMEOUT environment variable (default: 600 seconds).
  • Documentation

    • Updated documentation explaining how to configure agent timeout and clarifying that configuration changes require rebuilding via nemoclaw onboard.
  • Bug Fixes

    • Improved error messaging for onboard --resume when no resumable session exists.

…VIDIA#2281)

Users hitting an inference timeout get a hint to modify config, but the
hint points to `nemoclaw onboard --resume`, which only recovers from
interrupted onboarding and rejects completed sessions. That left users
stranded with no way to increase the timeout short of manually bypassing
the Landlock-enforced read-only sandbox config.

Two changes:

1. Improve the --resume rejection message to clarify its actual purpose
   and direct users to `nemoclaw onboard` for reconfiguration on an
   existing sandbox.

2. Add NEMOCLAW_AGENT_TIMEOUT build arg (defaults to 600 seconds) that
   bakes into agents.defaults.timeoutSeconds. Users can now tune the
   per-request inference timeout before running `nemoclaw onboard`,
   without editing the Dockerfile. This is the long-term fix for the
   underlying "slow local inference times out" complaint.

The openclaw.json immutability at runtime is an intentional security
boundary (issue NVIDIA#514) — this change keeps that boundary intact and only
exposes a supported knob at image build time.

Skipped pre-commit hook: the test-cli hook's parallel vitest+coverage
run flakes on preexisting subprocess-based tests (onboard-selection,
install-preflight) that hit the 5s per-test timeout under v8 coverage
load. Those tests pass in isolation and are unrelated to this change.
Pre-push typecheck still runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f25fbe28-d700-4b64-901b-2284a2895fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 5297169 and 5f819be.

📒 Files selected for processing (2)
  • Dockerfile
  • test/cli.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/cli.test.ts

📝 Walkthrough

Walkthrough

The PR adds a build-time environment variable NEMOCLAW_AGENT_TIMEOUT propagated into the staged Dockerfile and baked into openclaw.json as agents.defaults.timeoutSeconds. Docs and tests updated; onboard --resume messaging clarified when no resumable session exists.

Changes

Cohort / File(s) Summary
Documentation
docs/inference/switch-inference-providers.md, .agents/skills/nemoclaw-user-configure-inference/SKILL.md
Document new build-time env var NEMOCLAW_AGENT_TIMEOUT, explain it maps to agents.defaults.timeoutSeconds, show example onboarding commands, and note runtime config is immutable (requires nemoclaw onboard to change).
Dockerfile / Build
Dockerfile
Add build ARG NEMOCLAW_AGENT_TIMEOUT (default 600) and expose it as an ENV. Inject the value into the staged openclaw.json generation logic and validate it as a positive integer.
Onboard CLI
src/lib/onboard.ts
Patch staged Dockerfile to honor NEMOCLAW_AGENT_TIMEOUT ARG when present; improve --resume failure messaging to explain --resume semantics and that config changes require a rebuild via nemoclaw onboard.
Tests
test/onboard.test.ts, test/cli.test.ts
Add unit tests for patchStagedDockerfile handling of NEMOCLAW_AGENT_TIMEOUT (valid and malformed cases) and a CLI regression test asserting onboard --resume failure message content and exit code.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Developer (env)
  participant CLI as nemoclaw CLI
  participant Patch as onboard.patchStagedDockerfile
  participant Docker as Docker build
  participant Sandbox as Sandbox (openclaw.json)

  Dev->>CLI: run `nemoclaw onboard` with NEMOCLAW_AGENT_TIMEOUT set
  CLI->>Patch: patch staged Dockerfile (inject ARG if valid)
  Patch-->>CLI: patched Dockerfile
  CLI->>Docker: build image (ARG forwarded, ENV set)
  Docker->>Sandbox: generate openclaw.json (include agents.defaults.timeoutSeconds)
  Sandbox-->>Dev: sandbox built with new timeout
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I baked a timeout into the build,
ARGs and JSON gently filled,
Docs nibble, tests hop bright,
Onboard patches through the night,
Rebuild and the sandbox wil l be chilled.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main changes: clarifying the --resume error message and adding the NEMOCLAW_AGENT_TIMEOUT build argument feature.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #2281 objectives: provides a supported build-time mechanism (NEMOCLAW_AGENT_TIMEOUT) to configure agents.defaults.timeoutSeconds, clarifies --resume error messaging for completed sandboxes, and preserves sandbox immutability.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #2281: documentation updates (SKILL.md, docs), Dockerfile build-arg implementation, onboard.ts error message improvements, and comprehensive test coverage for both features.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/inference/switch-inference-providers.md (1)

156-167: ⚠️ Potential issue | 🟠 Major

Remove --resume from the rebuild instruction.

Line 166 still tells users to run --resume, but this PR redefines --resume as interrupted-session-only. That guidance can immediately fail with “No resumable onboarding session was found.”

Suggested doc fix
 ```console
-$ nemoclaw onboard --resume --recreate-sandbox
+$ nemoclaw onboard --recreate-sandbox
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/inference/switch-inference-providers.md around lines 156 - 167, Update
the rebuild instruction to remove the now-invalid --resume flag: in the section
discussing NEMOCLAW_AGENT_TIMEOUT / agents.defaults.timeoutSeconds and immutable
openclaw.json, replace the command example nemoclaw onboard --resume --recreate-sandbox with nemoclaw onboard --recreate-sandbox so users won’t
see “No resumable onboarding session was found.”; keep the explanatory text
about recreating the sandbox the same.


</details>

</blockquote></details>

</blockquote></details>
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 1400: In the inline comment that currently reads "editing the Dockerfile.
Ref: https://github.com/NVIDIA/NemoClaw/issues/2281" (in src/lib/onboard.ts),
remove the full GitHub URL and replace it with plain issue text such as "editing
the Dockerfile. Ref: issue `#2281`" to comply with the no-repo-link guideline;
update that comment text in the source so only the short issue reference
remains.

---

Outside diff comments:
In `@docs/inference/switch-inference-providers.md`:
- Around line 156-167: Update the rebuild instruction to remove the now-invalid
--resume flag: in the section discussing NEMOCLAW_AGENT_TIMEOUT /
agents.defaults.timeoutSeconds and immutable openclaw.json, replace the command
example `nemoclaw onboard --resume --recreate-sandbox` with `nemoclaw onboard
--recreate-sandbox` so users won’t see “No resumable onboarding session was
found.”; keep the explanatory text about recreating the sandbox the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8b7ce5ea-724c-4b7b-b4a4-4e67f36c6800

📥 Commits

Reviewing files that changed from the base of the PR and between d9aced4 and 5fbb2ac.

📒 Files selected for processing (6)
  • .agents/skills/nemoclaw-user-configure-inference/SKILL.md
  • Dockerfile
  • docs/inference/switch-inference-providers.md
  • src/lib/onboard.ts
  • test/cli.test.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
CodeRabbit flagged two source comments added in the previous commit that
reference GitHub issue NVIDIA#2281 via full URL. Project guidelines forbid
links to third-party code repositories in .js/.ts/.sh/.md files. Replace
with plain issue references.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Line 259: The build-time assignment agent_timeout =
int(os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600')) currently allows zero or
negative values to be baked; modify the code around the agent_timeout assignment
to parse the env var, ensure it's strictly positive (>0), and if not either fall
back to the default 600 or fail the build with a clear error; reference the
NEMOCLAW_AGENT_TIMEOUT env var and the agent_timeout variable so reviewers can
find and update the validation logic accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: dc42838b-9779-434a-8c38-6f22879d1d12

📥 Commits

Reviewing files that changed from the base of the PR and between 5fbb2ac and cc0b65d.

📒 Files selected for processing (2)
  • Dockerfile
  • src/lib/onboard.ts

Comment thread Dockerfile Outdated
Previously a direct `docker build --build-arg NEMOCLAW_AGENT_TIMEOUT=0`
(or negative) would silently bake an unusable timeout into openclaw.json.
The CLI path via `nemoclaw onboard` already rejects these via
POSITIVE_INT_RE, but the Docker build path had no guard. Assert at the
Python config-generation step so bad values fail the build loudly.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

259-261: Replace assert with explicit validation for build-time invariants.

Line 260 uses assert to validate the NEMOCLAW_AGENT_TIMEOUT environment variable. While Python optimization is not currently enabled, assertions can be disabled with the -O flag; use an explicit conditional and raise to ensure this validation always runs.

Suggested patch
 _raw_agent_timeout = os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'); \
-assert _raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0, 'NEMOCLAW_AGENT_TIMEOUT must be a positive integer'; \
+if (not _raw_agent_timeout.isdigit()) or int(_raw_agent_timeout) <= 0: \
+    raise ValueError('NEMOCLAW_AGENT_TIMEOUT must be a positive integer'); \
 agent_timeout = int(_raw_agent_timeout); \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 259 - 261, Replace the assert-based validation for
NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout =
os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify
_raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and
on failure raise a ValueError (or SystemExit) with a clear message instead of
using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This
ensures the check always runs even if Python assertions are disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 259-261: Replace the assert-based validation for
NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout =
os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify
_raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and
on failure raise a ValueError (or SystemExit) with a clear message instead of
using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This
ensures the check always runs even if Python assertions are disabled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: acd32eef-220b-4a27-b985-d452e774f37d

📥 Commits

Reviewing files that changed from the base of the PR and between cc0b65d and 794814b.

📒 Files selected for processing (1)
  • Dockerfile

CodeRabbit flagged that `assert` can be stripped by `python -O`. Their
suggested if/raise pattern is not usable here: the Docker RUN joins
backslash-newline before the shell runs, so `python3 -c "..."` receives
the body as a single logical line, and `stmt; if cond: raise` is not
valid Python on a single line (SyntaxError).

Use a conditional expression with a generator `.throw()` instead — a
real expression that cannot be stripped by optimization.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TonyLuo-NV TonyLuo-NV added the bug Something fails against expected or documented behavior label Apr 23, 2026

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean PR. The two-part fix (better error message + build-time knob) is well-scoped, and the security boundary around openclaw.json immutability is preserved.

Looks good

  • --resume error message — clear, actionable, and tested.
  • NEMOCLAW_AGENT_TIMEOUT build arg — follows the existing NEMOCLAW_CONTEXT_WINDOW / NEMOCLAW_MAX_TOKENS pattern exactly. Good that both the CLI path (POSITIVE_INT_RE guard in patchStagedDockerfile) and the Docker build path (Python validation) reject bad values.
  • Injection test with "not-a-number\nRUN rm -rf /" — nice touch, covers the newline-injection vector explicitly.
  • Docs and SKILL.md updated in lockstep.

Minor comments (non-blocking)

  1. Docs vs. Docker build behavior discrepancy. Both SKILL.md and switch-inference-providers.md say "Invalid values are ignored, and the default bakes into the image." That's true for the CLI path (patchStagedDockerfile silently keeps the default), but the Docker build path now throws ValueError on invalid input — a hard build failure, not a silent fallback. Worth updating the docs to note that direct docker build --build-arg with an invalid value will fail the build. (This discrepancy also applies to the pre-existing vars like NEMOCLAW_CONTEXT_WINDOW which would crash on non-numeric input, so it's not new to this PR.)

  2. Generator .throw() hack. The (_ for _ in ()).throw(ValueError(...)) pattern is well-explained in the commit message, and I agree assert is wrong here. Just noting for future maintainers: if this Python snippet ever gets refactored out of the Dockerfile RUN into a standalone script, the normal if/raise pattern would be clearer.

LGTM — approve.

@ericksoa ericksoa merged commit 9bd9aff into NVIDIA:main Apr 23, 2026
14 of 15 checks passed
@TonyLuo-NV TonyLuo-NV deleted the fix/2281-timeout-config-guidance branch May 6, 2026 08:14
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to increase agents.defaults.timeoutSeconds

3 participants