Skip to content

fix(inference): extend NIM first-start health wait (Fixes #3886)#4332

Merged
cv merged 4 commits into
NVIDIA:mainfrom
deepujain:fix/3886-nim-startup-timeout
Jun 4, 2026
Merged

fix(inference): extend NIM first-start health wait (Fixes #3886)#4332
cv merged 4 commits into
NVIDIA:mainfrom
deepujain:fix/3886-nim-startup-timeout

Conversation

@deepujain

@deepujain deepujain commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #3886.

Local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards. The container is still doing useful work, but NemoClaw gives up at 300 seconds and falls back away from the working local NIM setup.

This raises the default NIM health wait to 900 seconds while keeping the existing fast-fail behavior when the container exits early.

Changes

  • src/lib/inference/nim.ts: names the default NIM health timeout and increases it from 300s to 900s.
  • src/lib/inference/nim.test.ts: adds a regression test for the longer default timeout and keeps the exited-container short-circuit coverage intact.

Testing

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run src/lib/inference/nim.test.ts -t 'waitForNimHealth'
  • npx vitest run src/lib/inference/nim.test.ts

Evidence

The focused wait-path test now verifies the default health wait logs and uses timeout: 900s. The full NIM helper test file passes with 62 tests.

Summary by CodeRabbit

  • Bug Fixes

    • Increased the default managed NIM health-check timeout from 5 minutes to 15 minutes to allow longer startup waits.
  • Tests

    • Added a test validating the extended health-check timeout and verifying startup log output.
  • Documentation

    • Clarified docs: NIM’s managed startup health wait defaults to 900s (separate from other timeouts); other validators use a 180s default and onboarding still exits if the container stops before healthy.

Signed-off-by: Carlos Villela cvillela@nvidia.com

@copy-pr-bot

copy-pr-bot Bot commented May 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81cfebbf-bfa7-4598-b15d-226957f5cffe

📥 Commits

Reviewing files that changed from the base of the PR and between 4fdf203 and 01ee939.

📒 Files selected for processing (2)
  • docs/inference/use-local-inference.mdx
  • docs/reference/troubleshooting.mdx
✅ Files skipped from review due to trivial changes (2)
  • docs/reference/troubleshooting.mdx
  • docs/inference/use-local-inference.mdx

📝 Walkthrough

Walkthrough

Export a 900s NIM health timeout constant; use it as waitForNimHealth's default; add a test validating the exported default, return value, and logs; update an inline regression comment; and document the 900s managed NIM startup health-wait in two docs.

Changes

NIM Health Probe Timeout

Layer / File(s) Summary
Health timeout constant and function default
src/lib/inference/nim.ts
Adds DEFAULT_NIM_HEALTH_TIMEOUT_SECONDS = 900, replaces hardcoded 300 default in waitForNimHealth, and updates an inline comment to reference 900s.
Test coverage and regression comment
src/lib/inference/nim.test.ts
Adds a waitForNimHealth test that stubs the health endpoint, captures console output, asserts the constant is 900, verifies waitForNimHealth(9000) returns true, and updates the inline regression comment to "generic startup timeout".
Docs: NIM timeout clarifications
docs/inference/use-local-inference.mdx, docs/reference/troubleshooting.mdx
Adds notes that NIM's managed startup health wait defaults to 900 seconds, distinguishes it from other timeouts, and notes onboarding can still exit early if the container stops before healthy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Platform: DGX Spark

"A rabbit watches the logs unwind,
Nine hundred seconds, patient mind.
Shards align, the startup hums,
No hurried fallback, steady drums.
Hooray — the model wakes in time!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(inference): extend NIM first-start health wait (Fixes #3886)' directly and accurately summarizes the primary change: extending the NIM health probe timeout from 300s to 900s to resolve issue #3886.
Linked Issues check ✅ Passed All code changes directly address issue #3886 requirements: the default timeout is extended from 300s to 900s, fast-fail behavior is preserved when the container exits early, and the regression test validates the new default.
Out of Scope Changes check ✅ Passed All changes are in scope: code changes implement the timeout extension with a named constant and test coverage, documentation updates clarify the new behavior, and GPU naming (secondary issue) is not addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/inference/nim.ts (1)

698-701: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale timeout comment to match the new default.

Line [700] still mentions default 300s, but the default is now 900s. This is now misleading for maintenance/debugging.

Suggested patch
-    // full timeout (default 300s) against a dead container. See `#3333`.
+    // full timeout (default 900s) against a dead container. See `#3333`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.ts` around lines 698 - 701, Update the stale inline
comment in src/lib/inference/nim.ts that mentions "default 300s" to the new
default "900s" in the short-circuit container check (the block using the
variable `container` that early-exits when the container has already exited);
keep the rest of the comment intact but replace the numeric timeout to 900s so
documentation matches the current default timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/inference/nim.ts`:
- Around line 698-701: Update the stale inline comment in
src/lib/inference/nim.ts that mentions "default 300s" to the new default "900s"
in the short-circuit container check (the block using the variable `container`
that early-exits when the container has already exited); keep the rest of the
comment intact but replace the numeric timeout to 900s so documentation matches
the current default timeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a018bc6-9647-48d7-876b-d3761d585dd4

📥 Commits

Reviewing files that changed from the base of the PR and between e139dbc and 8c75366.

📒 Files selected for processing (2)
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts

@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR that extends the NIM first-start health wait. This proposes a fix for the issue where local NIM can take longer than five minutes on a first DGX Spark run while vLLM loads checkpoint shards, and updates the default NIM health timeout to 900 seconds.


Related open issues:

Fixes NVIDIA#3886

Signed-off-by: Deepak Jain <deepujain@gmail.com>
@deepujain deepujain force-pushed the fix/3886-nim-startup-timeout branch from 8c75366 to 8dd85b6 Compare May 30, 2026 02:01
@deepujain

Copy link
Copy Markdown
Contributor Author

Rebased on current main and fixed the stale NIM timeout comment. build:cli and the NIM tests pass.

@wscurran wscurran added the v0.0.58 Release target label Jun 2, 2026
@cv cv self-assigned this Jun 3, 2026
@wscurran wscurran added area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality platform: dgx-spark Affects DGX Spark hardware or workflows and removed Platform: DGX Spark labels Jun 3, 2026
@cv cv added v0.0.60 Release target and removed v0.0.58 Release target labels Jun 3, 2026
@wscurran wscurran added v0.0.59 Release target and removed v0.0.59 Release target labels Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/inference/use-local-inference.mdx (1)

1-14: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move SPDX header out of frontmatter and use Markdown HTML comments.

The SPDX text is currently inside YAML frontmatter as # comments. For .mdx, this must be an HTML-comment SPDX header placed after frontmatter.

Proposed fix
 ---
-# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
 title: "Use a Local Inference Server"
 sidebar-title: "Use Local Inference"
 description: "Connect NemoClaw to a local model server such as Ollama, vLLM, TensorRT-LLM, or any OpenAI-compatible endpoint."
@@
 skill:
   priority: 10
 ---
+<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
 import { AgentOnly } from "../_components/AgentGuide";

As per coding guidelines, "**/*.{...md,mdx...} must include SPDX headers, and use HTML comments for Markdown" and for docs/**, "SPDX license header is present after frontmatter."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.mdx` around lines 1 - 14, The SPDX header
lines currently prefixed with '#' are inside the YAML frontmatter; move them out
of the frontmatter and place a single SPDX header using HTML comments
immediately after the frontmatter block. Specifically, remove the two lines
starting with "SPDX-FileCopyrightText" and "SPDX-License-Identifier" from the
YAML and insert them as HTML comments (e.g. <!-- SPDX-FileCopyrightText: ... -->
and <!-- SPDX-License-Identifier: ... -->) directly after the closing --- so the
frontmatter (title/description/etc.) remains intact and the SPDX header sits in
the MDX body as required for docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/inference/use-local-inference.mdx`:
- Around line 1-14: The SPDX header lines currently prefixed with '#' are inside
the YAML frontmatter; move them out of the frontmatter and place a single SPDX
header using HTML comments immediately after the frontmatter block.
Specifically, remove the two lines starting with "SPDX-FileCopyrightText" and
"SPDX-License-Identifier" from the YAML and insert them as HTML comments (e.g.
<!-- SPDX-FileCopyrightText: ... --> and <!-- SPDX-License-Identifier: ... -->)
directly after the closing --- so the frontmatter (title/description/etc.)
remains intact and the SPDX header sits in the MDX body as required for docs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bf91bbdc-7a6c-4eae-8378-1baad799877d

📥 Commits

Reviewing files that changed from the base of the PR and between 8dd85b6 and 4fdf203.

📒 Files selected for processing (4)
  • docs/inference/use-local-inference.mdx
  • docs/reference/troubleshooting.mdx
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
💤 Files with no reviewable changes (3)
  • docs/reference/troubleshooting.mdx
  • src/lib/inference/nim.ts
  • src/lib/inference/nim.test.ts

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Salvage pass complete. I pushed docs coverage for the user-facing 15-minute managed NIM startup wait, added the PR-body DCO sign-off for my maintainer commits, approved the fork workflows, and verified the gate status. CI is green, CodeRabbit has no unresolved major/critical findings, risky inference code has regression coverage, and GitHub reports the branch as mergeable.

@cv cv merged commit af7105c into NVIDIA:main Jun 4, 2026
21 checks passed
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.

## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.

Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Updated default model for local Ollama inference setup to qwen3.5:9b
  * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran removed the feature PR adds or expands user-visible functionality label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows v0.0.60 Release target

Projects

None yet

3 participants