fix(onboard): clarify --resume error and add NEMOCLAW_AGENT_TIMEOUT (#2281)#2326
Conversation
…VIDIA#2281) Users hitting an inference timeout get a hint to modify config, but the hint points to `nemoclaw onboard --resume`, which only recovers from interrupted onboarding and rejects completed sessions. That left users stranded with no way to increase the timeout short of manually bypassing the Landlock-enforced read-only sandbox config. Two changes: 1. Improve the --resume rejection message to clarify its actual purpose and direct users to `nemoclaw onboard` for reconfiguration on an existing sandbox. 2. Add NEMOCLAW_AGENT_TIMEOUT build arg (defaults to 600 seconds) that bakes into agents.defaults.timeoutSeconds. Users can now tune the per-request inference timeout before running `nemoclaw onboard`, without editing the Dockerfile. This is the long-term fix for the underlying "slow local inference times out" complaint. The openclaw.json immutability at runtime is an intentional security boundary (issue NVIDIA#514) — this change keeps that boundary intact and only exposes a supported knob at image build time. Skipped pre-commit hook: the test-cli hook's parallel vitest+coverage run flakes on preexisting subprocess-based tests (onboard-selection, install-preflight) that hit the 5s per-test timeout under v8 coverage load. Those tests pass in isolation and are unrelated to this change. Pre-push typecheck still runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThe PR adds a build-time environment variable Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer (env)
participant CLI as nemoclaw CLI
participant Patch as onboard.patchStagedDockerfile
participant Docker as Docker build
participant Sandbox as Sandbox (openclaw.json)
Dev->>CLI: run `nemoclaw onboard` with NEMOCLAW_AGENT_TIMEOUT set
CLI->>Patch: patch staged Dockerfile (inject ARG if valid)
Patch-->>CLI: patched Dockerfile
CLI->>Docker: build image (ARG forwarded, ENV set)
Docker->>Sandbox: generate openclaw.json (include agents.defaults.timeoutSeconds)
Sandbox-->>Dev: sandbox built with new timeout
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/inference/switch-inference-providers.md (1)
156-167:⚠️ Potential issue | 🟠 MajorRemove
--resumefrom the rebuild instruction.Line 166 still tells users to run
--resume, but this PR redefines--resumeas interrupted-session-only. That guidance can immediately fail with “No resumable onboarding session was found.”Suggested doc fix
```console -$ nemoclaw onboard --resume --recreate-sandbox +$ nemoclaw onboard --recreate-sandbox</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/inference/switch-inference-providers.mdaround lines 156 - 167, Update
the rebuild instruction to remove the now-invalid --resume flag: in the section
discussing NEMOCLAW_AGENT_TIMEOUT / agents.defaults.timeoutSeconds and immutable
openclaw.json, replace the command examplenemoclaw onboard --resume --recreate-sandboxwithnemoclaw onboard --recreate-sandboxso users won’t
see “No resumable onboarding session was found.”; keep the explanatory text
about recreating the sandbox the same.</details> </blockquote></details> </blockquote></details>🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In `@src/lib/onboard.ts`: - Line 1400: In the inline comment that currently reads "editing the Dockerfile. Ref: https://github.com/NVIDIA/NemoClaw/issues/2281" (in src/lib/onboard.ts), remove the full GitHub URL and replace it with plain issue text such as "editing the Dockerfile. Ref: issue `#2281`" to comply with the no-repo-link guideline; update that comment text in the source so only the short issue reference remains. --- Outside diff comments: In `@docs/inference/switch-inference-providers.md`: - Around line 156-167: Update the rebuild instruction to remove the now-invalid --resume flag: in the section discussing NEMOCLAW_AGENT_TIMEOUT / agents.defaults.timeoutSeconds and immutable openclaw.json, replace the command example `nemoclaw onboard --resume --recreate-sandbox` with `nemoclaw onboard --recreate-sandbox` so users won’t see “No resumable onboarding session was found.”; keep the explanatory text about recreating the sandbox the same.🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID:
8b7ce5ea-724c-4b7b-b4a4-4e67f36c6800📒 Files selected for processing (6)
.agents/skills/nemoclaw-user-configure-inference/SKILL.mdDockerfiledocs/inference/switch-inference-providers.mdsrc/lib/onboard.tstest/cli.test.tstest/onboard.test.ts
CodeRabbit flagged two source comments added in the previous commit that reference GitHub issue NVIDIA#2281 via full URL. Project guidelines forbid links to third-party code repositories in .js/.ts/.sh/.md files. Replace with plain issue references. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile`:
- Line 259: The build-time assignment agent_timeout =
int(os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600')) currently allows zero or
negative values to be baked; modify the code around the agent_timeout assignment
to parse the env var, ensure it's strictly positive (>0), and if not either fall
back to the default 600 or fail the build with a clear error; reference the
NEMOCLAW_AGENT_TIMEOUT env var and the agent_timeout variable so reviewers can
find and update the validation logic accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: dc42838b-9779-434a-8c38-6f22879d1d12
📒 Files selected for processing (2)
Dockerfilesrc/lib/onboard.ts
Previously a direct `docker build --build-arg NEMOCLAW_AGENT_TIMEOUT=0` (or negative) would silently bake an unusable timeout into openclaw.json. The CLI path via `nemoclaw onboard` already rejects these via POSITIVE_INT_RE, but the Docker build path had no guard. Assert at the Python config-generation step so bad values fail the build loudly. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile (1)
259-261: Replaceassertwith explicit validation for build-time invariants.Line 260 uses
assertto validate theNEMOCLAW_AGENT_TIMEOUTenvironment variable. While Python optimization is not currently enabled, assertions can be disabled with the-Oflag; use an explicit conditional andraiseto ensure this validation always runs.Suggested patch
_raw_agent_timeout = os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'); \ -assert _raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0, 'NEMOCLAW_AGENT_TIMEOUT must be a positive integer'; \ +if (not _raw_agent_timeout.isdigit()) or int(_raw_agent_timeout) <= 0: \ + raise ValueError('NEMOCLAW_AGENT_TIMEOUT must be a positive integer'); \ agent_timeout = int(_raw_agent_timeout); \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 259 - 261, Replace the assert-based validation for NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout = os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify _raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and on failure raise a ValueError (or SystemExit) with a clear message instead of using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This ensures the check always runs even if Python assertions are disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile`:
- Around line 259-261: Replace the assert-based validation for
NEMOCLAW_AGENT_TIMEOUT with explicit runtime checking: read _raw_agent_timeout =
os.environ.get('NEMOCLAW_AGENT_TIMEOUT', '600'), verify
_raw_agent_timeout.isdigit() and int(_raw_agent_timeout) > 0 in an if/else, and
on failure raise a ValueError (or SystemExit) with a clear message instead of
using assert; then set agent_timeout = int(_raw_agent_timeout) when valid. This
ensures the check always runs even if Python assertions are disabled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: acd32eef-220b-4a27-b985-d452e774f37d
📒 Files selected for processing (1)
Dockerfile
CodeRabbit flagged that `assert` can be stripped by `python -O`. Their suggested if/raise pattern is not usable here: the Docker RUN joins backslash-newline before the shell runs, so `python3 -c "..."` receives the body as a single logical line, and `stmt; if cond: raise` is not valid Python on a single line (SyntaxError). Use a conditional expression with a generator `.throw()` instead — a real expression that cannot be stripped by optimization. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericksoa
left a comment
There was a problem hiding this comment.
Clean PR. The two-part fix (better error message + build-time knob) is well-scoped, and the security boundary around openclaw.json immutability is preserved.
Looks good
--resumeerror message — clear, actionable, and tested.NEMOCLAW_AGENT_TIMEOUTbuild arg — follows the existingNEMOCLAW_CONTEXT_WINDOW/NEMOCLAW_MAX_TOKENSpattern exactly. Good that both the CLI path (POSITIVE_INT_REguard inpatchStagedDockerfile) and the Docker build path (Python validation) reject bad values.- Injection test with
"not-a-number\nRUN rm -rf /"— nice touch, covers the newline-injection vector explicitly. - Docs and SKILL.md updated in lockstep.
Minor comments (non-blocking)
-
Docs vs. Docker build behavior discrepancy. Both
SKILL.mdandswitch-inference-providers.mdsay "Invalid values are ignored, and the default bakes into the image." That's true for the CLI path (patchStagedDockerfile silently keeps the default), but the Docker build path now throwsValueErroron invalid input — a hard build failure, not a silent fallback. Worth updating the docs to note that directdocker build --build-argwith an invalid value will fail the build. (This discrepancy also applies to the pre-existing vars likeNEMOCLAW_CONTEXT_WINDOWwhich would crash on non-numeric input, so it's not new to this PR.) -
Generator
.throw()hack. The(_ for _ in ()).throw(ValueError(...))pattern is well-explained in the commit message, and I agreeassertis wrong here. Just noting for future maintainers: if this Python snippet ever gets refactored out of the Dockerfile RUN into a standalone script, the normalif/raisepattern would be clearer.
LGTM — approve.
Summary
Fixes #2281 — users hit an inference timeout, follow the "run
nemoclaw onboard --resume" hint, and get stranded with "No resumable onboarding session was found".Two changes:
Improve the
--resumerejection message (src/lib/onboard.ts). The message now explains what--resumeactually does (resume an interrupted onboarding) and points users tonemoclaw onboardfor reconfiguring an existing sandbox.Add
NEMOCLAW_AGENT_TIMEOUTbuild arg (defaults to600seconds) that bakes intoagents.defaults.timeoutSeconds. Users can now tune the per-request inference timeout at build time without editing the Dockerfile — this is the long-term fix for the underlying "slow local inference times out" complaint.The
openclaw.jsonimmutability at runtime is an intentional security boundary (#514). This PR keeps that boundary intact and only exposes a supported knob at image build time, consistent with the existingNEMOCLAW_CONTEXT_WINDOW/NEMOCLAW_MAX_TOKENSpattern.Test plan
npx vitest run test/cli.test.ts test/onboard.test.ts— 192/192 passing, including 3 new tests covering the improved message and the timeout build-arg patching (good + malformed paths).npm run typecheck:clipasses.NEMOCLAW_AGENT_TIMEOUT=1800→openclaw.jsonbakes"timeoutSeconds": 1800"timeoutSeconds": 600NEMOCLAW_AGENT_TIMEOUT=0→ build fails withValueError: NEMOCLAW_AGENT_TIMEOUT must be a positive integerNEMOCLAW_AGENT_TIMEOUT=-5→ build fails with same errorSigned-off-by: Tony Luo xialuo@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
NEMOCLAW_AGENT_TIMEOUTenvironment variable (default: 600 seconds).Documentation
nemoclaw onboard.Bug Fixes
onboard --resumewhen no resumable session exists.