fix(inference): extend NIM first-start health wait (Fixes #3886)#4332
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughExport a 900s NIM health timeout constant; use it as waitForNimHealth's default; add a test validating the exported default, return value, and logs; update an inline regression comment; and document the 900s managed NIM startup health-wait in two docs. ChangesNIM Health Probe Timeout
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/inference/nim.ts (1)
698-701:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale timeout comment to match the new default.
Line [700] still mentions
default 300s, but the default is now 900s. This is now misleading for maintenance/debugging.Suggested patch
- // full timeout (default 300s) against a dead container. See `#3333`. + // full timeout (default 900s) against a dead container. See `#3333`.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/nim.ts` around lines 698 - 701, Update the stale inline comment in src/lib/inference/nim.ts that mentions "default 300s" to the new default "900s" in the short-circuit container check (the block using the variable `container` that early-exits when the container has already exited); keep the rest of the comment intact but replace the numeric timeout to 900s so documentation matches the current default timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/inference/nim.ts`:
- Around line 698-701: Update the stale inline comment in
src/lib/inference/nim.ts that mentions "default 300s" to the new default "900s"
in the short-circuit container check (the block using the variable `container`
that early-exits when the container has already exited); keep the rest of the
comment intact but replace the numeric timeout to 900s so documentation matches
the current default timeout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6a018bc6-9647-48d7-876b-d3761d585dd4
📒 Files selected for processing (2)
src/lib/inference/nim.test.tssrc/lib/inference/nim.ts
|
✨ Thanks for submitting this detailed PR that extends the NIM first-start health wait. This proposes a fix for the issue where local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards, and updates the default NIM health timeout to 900 seconds. Related open issues: |
Fixes NVIDIA#3886 Signed-off-by: Deepak Jain <deepujain@gmail.com>
8c75366 to
8dd85b6
Compare
|
Rebased on current main and fixed the stale NIM timeout comment. build:cli and the NIM tests pass. |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/inference/use-local-inference.mdx (1)
1-14:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMove SPDX header out of frontmatter and use Markdown HTML comments.
The SPDX text is currently inside YAML frontmatter as
#comments. For.mdx, this must be an HTML-comment SPDX header placed after frontmatter.Proposed fix
--- -# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 title: "Use a Local Inference Server" sidebar-title: "Use Local Inference" description: "Connect NemoClaw to a local model server such as Ollama, vLLM, TensorRT-LLM, or any OpenAI-compatible endpoint." @@ skill: priority: 10 --- +<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> +<!-- SPDX-License-Identifier: Apache-2.0 --> import { AgentOnly } from "../_components/AgentGuide";As per coding guidelines, "
**/*.{...md,mdx...}must include SPDX headers, and use HTML comments for Markdown" and fordocs/**, "SPDX license header is present after frontmatter."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.mdx` around lines 1 - 14, The SPDX header lines currently prefixed with '#' are inside the YAML frontmatter; move them out of the frontmatter and place a single SPDX header using HTML comments immediately after the frontmatter block. Specifically, remove the two lines starting with "SPDX-FileCopyrightText" and "SPDX-License-Identifier" from the YAML and insert them as HTML comments (e.g. <!-- SPDX-FileCopyrightText: ... --> and <!-- SPDX-License-Identifier: ... -->) directly after the closing --- so the frontmatter (title/description/etc.) remains intact and the SPDX header sits in the MDX body as required for docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/inference/use-local-inference.mdx`:
- Around line 1-14: The SPDX header lines currently prefixed with '#' are inside
the YAML frontmatter; move them out of the frontmatter and place a single SPDX
header using HTML comments immediately after the frontmatter block.
Specifically, remove the two lines starting with "SPDX-FileCopyrightText" and
"SPDX-License-Identifier" from the YAML and insert them as HTML comments (e.g.
<!-- SPDX-FileCopyrightText: ... --> and <!-- SPDX-License-Identifier: ... -->)
directly after the closing --- so the frontmatter (title/description/etc.)
remains intact and the SPDX header sits in the MDX body as required for docs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bf91bbdc-7a6c-4eae-8378-1baad799877d
📒 Files selected for processing (4)
docs/inference/use-local-inference.mdxdocs/reference/troubleshooting.mdxsrc/lib/inference/nim.test.tssrc/lib/inference/nim.ts
💤 Files with no reviewable changes (3)
- docs/reference/troubleshooting.mdx
- src/lib/inference/nim.ts
- src/lib/inference/nim.test.ts
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
cv
left a comment
There was a problem hiding this comment.
Salvage pass complete. I pushed docs coverage for the user-facing 15-minute managed NIM startup wait, added the PR-body DCO sign-off for my maintainer commits, approved the fork workflows, and verified the gate status. CI is green, CodeRabbit has no unresolved major/critical findings, risky inference code has regression coverage, and GitHub reports the branch as mergeable.
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.
## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.
## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.
Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated default model for local Ollama inference setup to qwen3.5:9b
* Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fixes #3886.
Local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards. The container is still doing useful work, but NemoClaw gives up at 300 seconds and falls back away from the working local NIM setup.
This raises the default NIM health wait to 900 seconds while keeping the existing fast-fail behavior when the container exits early.
Changes
src/lib/inference/nim.ts: names the default NIM health timeout and increases it from 300s to 900s.src/lib/inference/nim.test.ts: adds a regression test for the longer default timeout and keeps the exited-container short-circuit coverage intact.Testing
npm run build:clinpm run typecheck:clinpx vitest run src/lib/inference/nim.test.ts -t 'waitForNimHealth'npx vitest run src/lib/inference/nim.test.tsEvidence
The focused wait-path test now verifies the default health wait logs and uses
timeout: 900s. The full NIM helper test file passes with 62 tests.Summary by CodeRabbit
Bug Fixes
Tests
Documentation
Signed-off-by: Carlos Villela cvillela@nvidia.com