fix(inference): extend NIM first-start health wait (Fixes #3886) by deepujain · Pull Request #4332 · NVIDIA/NemoClaw

deepujain · 2026-05-27T16:24:34Z

Summary

Local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards. The container is still doing useful work, but NemoClaw gives up at 300 seconds and falls back away from the working local NIM setup.

This raises the default NIM health wait to 900 seconds while keeping the existing fast-fail behavior when the container exits early.

Changes

src/lib/inference/nim.ts: names the default NIM health timeout and increases it from 300s to 900s.
src/lib/inference/nim.test.ts: adds a regression test for the longer default timeout and keeps the exited-container short-circuit coverage intact.

Testing

npm run build:cli
npm run typecheck:cli
npx vitest run src/lib/inference/nim.test.ts -t 'waitForNimHealth'
npx vitest run src/lib/inference/nim.test.ts

Evidence

The focused wait-path test now verifies the default health wait logs and uses timeout: 900s. The full NIM helper test file passes with 62 tests.

Summary by CodeRabbit

Bug Fixes
- Increased the default managed NIM health-check timeout from 5 minutes to 15 minutes to allow longer startup waits.
Tests
- Added a test validating the extended health-check timeout and verifying startup log output.
Documentation
- Clarified docs: NIM’s managed startup health wait defaults to 900s (separate from other timeouts); other validators use a 180s default and onboarding still exits if the container stops before healthy.

Signed-off-by: Carlos Villela cvillela@nvidia.com

copy-pr-bot · 2026-05-27T16:24:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-27T16:26:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81cfebbf-bfa7-4598-b15d-226957f5cffe

📥 Commits

Reviewing files that changed from the base of the PR and between 4fdf203 and 01ee939.

📒 Files selected for processing (2)

docs/inference/use-local-inference.mdx
docs/reference/troubleshooting.mdx

✅ Files skipped from review due to trivial changes (2)

docs/reference/troubleshooting.mdx
docs/inference/use-local-inference.mdx

📝 Walkthrough

Walkthrough

Export a 900s NIM health timeout constant; use it as waitForNimHealth's default; add a test validating the exported default, return value, and logs; update an inline regression comment; and document the 900s managed NIM startup health-wait in two docs.

Changes

NIM Health Probe Timeout

Layer / File(s)	Summary
Health timeout constant and function default `src/lib/inference/nim.ts`	Adds `DEFAULT_NIM_HEALTH_TIMEOUT_SECONDS = 900`, replaces hardcoded `300` default in `waitForNimHealth`, and updates an inline comment to reference `900s`.
Test coverage and regression comment `src/lib/inference/nim.test.ts`	Adds a `waitForNimHealth` test that stubs the health endpoint, captures console output, asserts the constant is `900`, verifies `waitForNimHealth(9000)` returns `true`, and updates the inline regression comment to "generic startup timeout".
Docs: NIM timeout clarifications `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`	Adds notes that NIM's managed startup health wait defaults to `900` seconds, distinguishes it from other timeouts, and notes onboarding can still exit early if the container stops before healthy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Platform: DGX Spark

"A rabbit watches the logs unwind,
Nine hundred seconds, patient mind.
Shards align, the startup hums,
No hurried fallback, steady drums.
Hooray — the model wakes in time!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(inference): extend NIM first-start health wait (Fixes `#3886`)' directly and accurately summarizes the primary change: extending the NIM health probe timeout from 300s to 900s to resolve issue `#3886`.
Linked Issues check	✅ Passed	All code changes directly address issue `#3886` requirements: the default timeout is extended from 300s to 900s, fast-fail behavior is preserved when the container exits early, and the regression test validates the new default.
Out of Scope Changes check	✅ Passed	All changes are in scope: code changes implement the timeout extension with a named constant and test coverage, documentation updates clarify the new behavior, and GPU naming (secondary issue) is not addressed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/inference/nim.ts (1)

698-701: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale timeout comment to match the new default.

Line [700] still mentions default 300s, but the default is now 900s. This is now misleading for maintenance/debugging.

Suggested patch

-    // full timeout (default 300s) against a dead container. See `#3333`.
+    // full timeout (default 900s) against a dead container. See `#3333`.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.ts` around lines 698 - 701, Update the stale inline
comment in src/lib/inference/nim.ts that mentions "default 300s" to the new
default "900s" in the short-circuit container check (the block using the
variable `container` that early-exits when the container has already exited);
keep the rest of the comment intact but replace the numeric timeout to 900s so
documentation matches the current default timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/inference/nim.ts`:
- Around line 698-701: Update the stale inline comment in
src/lib/inference/nim.ts that mentions "default 300s" to the new default "900s"
in the short-circuit container check (the block using the variable `container`
that early-exits when the container has already exited); keep the rest of the
comment intact but replace the numeric timeout to 900s so documentation matches
the current default timeout.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a018bc6-9647-48d7-876b-d3761d585dd4

📥 Commits

Reviewing files that changed from the base of the PR and between e139dbc and 8c75366.

📒 Files selected for processing (2)

src/lib/inference/nim.test.ts
src/lib/inference/nim.ts

wscurran · 2026-05-27T22:47:15Z

✨ Thanks for submitting this detailed PR that extends the NIM first-start health wait. This proposes a fix for the issue where local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards, and updates the default NIM health timeout to 900 seconds.

Related open issues:

#3886 [DGX Spark][Inference] NIM container 300s health probe timeout too short for first-time model checkpoint load; onboard falls back to cloud API instead of waiting

Fixes NVIDIA#3886 Signed-off-by: Deepak Jain <deepujain@gmail.com>

deepujain · 2026-05-30T02:02:52Z

Rebased on current main and fixed the stale NIM timeout comment. build:cli and the NIM tests pass.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/inference/use-local-inference.mdx (1)

1-14: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move SPDX header out of frontmatter and use Markdown HTML comments.

The SPDX text is currently inside YAML frontmatter as # comments. For .mdx, this must be an HTML-comment SPDX header placed after frontmatter.

Proposed fix

 ---
-# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
 title: "Use a Local Inference Server"
 sidebar-title: "Use Local Inference"
 description: "Connect NemoClaw to a local model server such as Ollama, vLLM, TensorRT-LLM, or any OpenAI-compatible endpoint."
@@
 skill:
   priority: 10
 ---
+<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
 import { AgentOnly } from "../_components/AgentGuide";

As per coding guidelines, "**/*.{...md,mdx...} must include SPDX headers, and use HTML comments for Markdown" and for docs/**, "SPDX license header is present after frontmatter."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.mdx` around lines 1 - 14, The SPDX header
lines currently prefixed with '#' are inside the YAML frontmatter; move them out
of the frontmatter and place a single SPDX header using HTML comments
immediately after the frontmatter block. Specifically, remove the two lines
starting with "SPDX-FileCopyrightText" and "SPDX-License-Identifier" from the
YAML and insert them as HTML comments (e.g. <!-- SPDX-FileCopyrightText: ... -->
and <!-- SPDX-License-Identifier: ... -->) directly after the closing --- so the
frontmatter (title/description/etc.) remains intact and the SPDX header sits in
the MDX body as required for docs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/inference/use-local-inference.mdx`:
- Around line 1-14: The SPDX header lines currently prefixed with '#' are inside
the YAML frontmatter; move them out of the frontmatter and place a single SPDX
header using HTML comments immediately after the frontmatter block.
Specifically, remove the two lines starting with "SPDX-FileCopyrightText" and
"SPDX-License-Identifier" from the YAML and insert them as HTML comments (e.g.
<!-- SPDX-FileCopyrightText: ... --> and <!-- SPDX-License-Identifier: ... -->)
directly after the closing --- so the frontmatter (title/description/etc.)
remains intact and the SPDX header sits in the MDX body as required for docs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bf91bbdc-7a6c-4eae-8378-1baad799877d

📥 Commits

Reviewing files that changed from the base of the PR and between 8dd85b6 and 4fdf203.

📒 Files selected for processing (4)

docs/inference/use-local-inference.mdx
docs/reference/troubleshooting.mdx
src/lib/inference/nim.test.ts
src/lib/inference/nim.ts

💤 Files with no reviewable changes (3)

docs/reference/troubleshooting.mdx
src/lib/inference/nim.ts
src/lib/inference/nim.test.ts

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

cv

Salvage pass complete. I pushed docs coverage for the user-facing 15-minute managed NIM startup wait, added the PR-body DCO sign-off for my maintainer commits, approved the fork workflows, and verified the gate status. CI is green, CodeRabbit has no unresolved major/critical findings, risky inference code has regression coverage, and GitHub reports the branch as mergeable.

## Summary - Add the v0.0.59 release notes from the GitHub announcement discussion. - Refresh local inference and credential-storage guidance for the current release behavior. - Regenerate the user skills from the updated Fern docs. - Tighten release-prep and docs review guidance for generated skills, PR labels, and shared `$$nemoclaw` command placeholders. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" --glob '*.{md,mdx}'` - `git diff --check` - `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC permission failure) - `npm run typecheck:cli` - Pre-commit hooks during commit passed, including markdownlint, docs-to-skills verification, gitleaks, commitlint, and skills YAML tests. ## Source Summary - #3679, #4437, #4681, #4766, #4772, #4775, #4786 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`, `docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27 compatibility, runtime path pinning, plugin registry recovery, live gateway reconciliation, and clearer host-alias/startup diagnostics. - #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`, `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/inference/switch-inference-providers.mdx`: Document the release inference changes covering Local NIM waits, Hermes Anthropic routing, Nemotron 3 Ultra, the current Ollama starter fallback, and Spark managed-vLLM context length. - #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`, `docs/security/credential-storage.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/troubleshooting.mdx`: Capture permission healing, gateway-stored credential reuse, cross-sandbox messaging credential conflict checks, and CDI preflight diagnostics. - #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`: Regenerate the user skill references from the updated source docs. - Follow-up maintenance -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`, `.coderabbit.yaml`: Add release-prep area labels for docs and skills PRs, and teach docs review guidance that `$$nemoclaw` is the correct shared command placeholder for examples that work across agent aliases. Note: the `documentation` label was not present in the repository, so this PR is labeled with `v0.0.59` only.  ## Summary by CodeRabbit * **Documentation** * Updated default model for local Ollama inference setup to qwen3.5:9b * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option * Clarified credential storage and reuse behavior for post-deployment (day-two) operations * Added v0.0.59 release notes covering OpenClaw compatibility, inference options, Hermes messaging sync, and troubleshooting * Clarified CLI selection guidance and updated OpenClaw version example in status output * Revised release-prep instructions and docs review guidance for CLI alias usage

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

wscurran added enhancement: inference labels May 27, 2026

fix(inference): extend NIM first-start health wait

8dd85b6

Fixes NVIDIA#3886 Signed-off-by: Deepak Jain <deepujain@gmail.com>

deepujain force-pushed the fix/3886-nim-startup-timeout branch from 8c75366 to 8dd85b6 Compare May 30, 2026 02:01

wscurran added the v0.0.58 Release target label Jun 2, 2026

cv self-assigned this Jun 3, 2026

wscurran added area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality platform: dgx-spark Affects DGX Spark hardware or workflows and removed Platform: DGX Spark labels Jun 3, 2026

cv added v0.0.60 Release target and removed v0.0.58 Release target labels Jun 3, 2026

wscurran added v0.0.59 Release target and removed v0.0.59 Release target labels Jun 4, 2026

cv added 2 commits June 4, 2026 10:58

Merge branch 'main' into fix/3886-nim-startup-timeout

9e78668

docs(inference): document NIM startup health wait

4fdf203

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

docs(inference): clarify NIM startup wait duration

01ee939

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

cv approved these changes Jun 4, 2026

View reviewed changes

cv merged commit af7105c into NVIDIA:main Jun 4, 2026
21 checks passed

miyoungc mentioned this pull request Jun 5, 2026

docs: refresh 0.0.59 release notes #4790

Merged

wscurran removed the feature PR adds or expands user-visible functionality label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): extend NIM first-start health wait (Fixes #3886)#4332

fix(inference): extend NIM first-start health wait (Fixes #3886)#4332
cv merged 4 commits into
NVIDIA:mainfrom
deepujain:fix/3886-nim-startup-timeout

deepujain commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

wscurran commented May 27, 2026

Uh oh!

deepujain commented May 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deepujain commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented May 27, 2026

Uh oh!

deepujain commented May 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deepujain commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading