feat(vllm): add NEMOCLAW_VLLM_MODEL override + gated-model check by laitingsheng · Pull Request #3642 · NVIDIA/NemoClaw

laitingsheng · 2026-05-16T05:15:59Z

Summary

Adds a scoped escape hatch on the express vLLM install path: setting NEMOCLAW_VLLM_MODEL=<slug> picks a different entry from a small new registry, swaps in the right vllm serve flags, and fails fast when a gated model is requested without a Hugging Face token (which is now forwarded into the managed-vLLM container so hf download and vllm serve can authenticate). An interactive model picker on the express path is not part of this PR — that remains a separate design item; the env-var path covers the gated-model coverage QA was trying to exercise.

Related Issue

Fixes #3566
Fixes #3572

Changes

Introduce src/lib/inference/vllm-models.ts with a typed registry, env-var resolver, gated-access check, and a shared buildVllmServeCommand that merges common flags with the model-specific args.
Seed the registry with Qwen3.6 27B FP8 (Spark/Station default), Nemotron-3 Nano 4B FP8 (generic-Linux default), and DeepSeek-R1 Distill Llama 70B (gated).
Resolver returns null when NEMOCLAW_VLLM_MODEL is unset so the caller can keep the per-platform profile default (the generic-Linux profile must stay on Nemotron-Nano-4B for VRAM headroom).
Refactor VllmProfile to carry a defaultModel: VllmModelDef instead of static model/command strings; installVllm resolves the model per call, validates gated access, builds the serve command on the fly, and forwards HF_TOKEN / HUGGING_FACE_HUB_TOKEN into both the hf download one-shot and the long-lived vllm serve container using the -e KEY (key-only) form so the secret stays out of the docker argv; the value is added back via the runner's env option so docker can inherit it without breaking the subprocess-env allowlist.
Document the env var, the recognised slugs, and the gated-model HF token requirement under the Managed vLLM section of use-local-inference.md; add it to the env-var reference table in commands.md to satisfy the env-var doc gate.
Update detect-vllm-profile.test.ts to assert against profile.defaultModel.id and add coverage for buildHfTokenDockerArgs + buildHfTokenForwardEnv (HF_TOKEN preferred, HUGGING_FACE_HUB_TOKEN fallback, empty/whitespace handled).
Add vllm-models.test.ts covering the env-var resolver (slug + HF id forms, unset-returns-null), unknown-value error, gated-access check, and command-builder output.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

`npx prek run --all-files` passes
`npm test` passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
`make docs` builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

Release Notes

New Features
- Added ability to select custom vLLM models via environment configuration, with support for Hugging Face authentication for gated models.
Documentation
- Added guides for configuring vLLM models, managing authentication credentials, and available model options.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

copy-pr-bot · 2026-05-16T05:16:02Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-16T05:16:04Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66028b00-0939-4fea-9912-7ff6fc976295

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1d51d and d13db34.

📒 Files selected for processing (3)

src/lib/inference/vllm-models.ts
src/lib/inference/vllm.ts
test/detect-vllm-profile.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/detect-vllm-profile.test.ts
src/lib/inference/vllm-models.ts

📝 Walkthrough

Walkthrough

This PR implements the NEMOCLAW_VLLM_MODEL environment variable to allow users to override the default vLLM model during express installation. It introduces a model registry with environment-driven selection, gated-model access validation, and refactors the vLLM profile system and container lifecycle to accept model objects and dynamically construct serving commands.

Changes

vLLM Model Registry and Selection

Layer / File(s)	Summary
Model registry contracts and data definitions `src/lib/inference/vllm-models.ts`	`VllmModelDef` interface captures model metadata including HuggingFace id, environment slug, max context length, model-specific vLLM arguments, and gated flag. `VLLM_MODELS` registry contains three pinned models; `DEFAULT_VLLM_MODEL` exports the first entry. `SHARED_VLLM_ARGS` defines immutable vLLM serve flags.
Environment variable selection and validation `src/lib/inference/vllm-models.ts`	`selectVllmModelFromEnv()` reads `NEMOCLAW_VLLM_MODEL`, resolves it case-insensitively against registry env slugs or full HuggingFace ids, and returns `null` when unset or throws a helpful error for unknown values. `assertGatedModelAccess()` validates that gated models have `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN` available.
Dynamic vLLM serve command construction `src/lib/inference/vllm-models.ts`	`buildVllmServeCommand()` generates a shell command installing `vllm[fastsafetensors]` and running `vllm serve` with shared args, model-specific vLLM arguments, and per-model `--max-model-len`.
Model registry test coverage `src/lib/inference/vllm-models.test.ts`	Vitest suite validates `selectVllmModelFromEnv` null/case-insensitive/error behavior, `assertGatedModelAccess` gated and non-gated flows with token presence/absence, and `buildVllmServeCommand` shared flags and model-specific argument variations including max-model-len and reasoning/tool parsers.

vLLM Container and Orchestration Integration

Layer / File(s)	Summary
HuggingFace token Docker argument builder `src/lib/inference/vllm.ts`	`buildHfTokenDockerArgs()` and `buildHfTokenForwardEnv()` generate `docker -e` args and a forwarded env map from host environment with fallback/precedence and whitespace handling.
Profile refactoring: hardcoded model string to object `src/lib/inference/vllm.ts`	`VllmProfile` interface transitions from `model: string` to `defaultModel: VllmModelDef`. Spark profile sets its default explicitly; Station inherits it; generic Linux profile uses `nemotronNanoModel()`. Static `command` field removed; command generation is deferred to container-start time.
Container download and start refactoring `src/lib/inference/vllm.ts`	`downloadModel(profile, model)` pre-downloads the resolved model with `hf download` and injects HF credentials. `startContainer(profile, model)` forwards HF credentials into the container and dynamically constructs the `vllm serve` command via `buildVllmServeCommand(model)`.
Model selection and install orchestration `src/lib/inference/vllm.ts`	`installVllm()` resolves the target model via `selectVllmModelFromEnv()` with fallback to `profile.defaultModel`, validates gated-model access before docker operations, logs the resolved model and override status, and passes the model to download and start functions.
Profile and integration test updates `test/detect-vllm-profile.test.ts`	Profile detection tests updated to assert `defaultModel.id` instead of `model`. New `buildHfTokenDockerArgs` and `buildHfTokenForwardEnv` test suites cover no-token, fallback, precedence, and whitespace behavior.

User-Facing Documentation

Layer / File(s)	Summary
Environment variable documentation `docs/inference/use-local-inference.md`, `docs/reference/commands.md`	User guide documents the `NEMOCLAW_VLLM_MODEL` override subsection with model slug list, case-insensitivity, gated-model requirements, and HuggingFace token provisioning. Reference guide adds the environment variable entry with model identifiers and token requirement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Local Models, documentation

Suggested reviewers

ericksoa
cv
jyaunches

Poem

A rabbit hops through model rows,
Picks Qwen, DeepSeek, or Nemotron flows,
With tokens tucked and gates unlocked,
vLLM serves what the user picked—
Express installer, choices restored! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main changes: adding NEMOCLAW_VLLM_MODEL environment variable support for overriding vLLM model selection and implementing gated-model access validation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/vllm-express-model-registry

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-16T05:18:00Z

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: inference-routing-e2e, docs-validation-e2e

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at 11fef1992daac79b91eb331937a2dceffa61372e — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

gpu-e2e (high (~30 minutes, GPU runner)): Closest existing E2E for local GPU-backed inference onboarding and sandbox inference flow. It does not exercise managed vLLM directly, but it is the only existing dispatchable GPU local-inference E2E and can catch broader local inference/onboard regressions caused by the changed inference module.

Optional E2E

inference-routing-e2e (medium (~30 minutes)): Useful adjacent coverage for inference.local routing, credential isolation, and classified provider failures. It does not cover managed vLLM container startup, so it should be optional.
docs-validation-e2e (low): Validates the edited inference/command documentation and command/env-var docs consistency after adding NEMOCLAW_VLLM_MODEL.

New E2E recommendations

managed-vLLM install model override (high): No existing E2E appears to run NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_PROVIDER=install-vllm or validate NEMOCLAW_VLLM_MODEL with a real managed vLLM container. The current GPU E2E is Ollama-only.
- Suggested test: Add a managed-vllm-install-e2e on a GPU runner that selects NEMOCLAW_VLLM_MODEL=nemotron-3-nano-4b, runs non-interactive onboarding, waits for the nemoclaw-vllm container, and verifies sandbox inference.local chat/completions works.
managed-vLLM gated model credential handling (high): The PR adds fail-fast token checks and Docker env forwarding for gated Hugging Face models, but there is no end-to-end coverage ensuring secrets are forwarded by key-only -e KEY and not exposed in process argv/logs.
- Suggested test: Add a managed-vllm-gated-model-credential-e2e or hermetic Docker-spy E2E that requests deepseek-r1-distill-70b, verifies missing-token failure happens before docker pull, and verifies present-token forwarding without argv/log leakage.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e

…ment NEMOCLAW_VLLM_MODEL Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-16T05:57:48Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25954415141
Target ref: 8f1d51d7b319fc7d0ac45804a8dc9c6ff77570f7
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

docs/reference/commands.md (1)

1156-1156: ⚡ Quick win

Use US English spelling.

"Recognised" uses British spelling.
Prefer "Recognized" for consistency with US English conventions.

Suggested revision

-| `NEMOCLAW_VLLM_MODEL` | registry slug or Hugging Face model id | Selects the model the managed-vLLM install path serves. Recognised slugs: `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`. Unset uses the per-platform profile default. Gated models (e.g. `deepseek-r1-distill-70b`) require `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN`. |
+| `NEMOCLAW_VLLM_MODEL` | registry slug or Hugging Face model id | Selects the model the managed-vLLM install path serves. Recognized slugs: `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`. Unset uses the per-platform profile default. Gated models (e.g. `deepseek-r1-distill-70b`) require `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN`. |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1156, Update the US English spelling in
the table cell describing the NEMOCLAW_VLLM_MODEL environment variable: change
the word "Recognised" to "Recognized" in the sentence that lists recognised
slugs (e.g., `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`) so
the description reads "Recognized slugs: ...". Keep the rest of the text (gating
note about `HF_TOKEN`/`HUGGING_FACE_HUB_TOKEN` and default behavior) unchanged.

docs/inference/use-local-inference.md (5)

312-312: ⚡ Quick win

Use US English spelling.

"Recognised" uses British spelling.
Prefer "Recognized" for consistency with US English conventions.

Suggested revision

-Recognised slugs:
+Recognized slugs:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.md` at line 312, Change the British
spelling "Recognised slugs:" to US English "Recognized slugs:" in the docs
string (look for the literal "Recognised slugs:" in use-local-inference.md) so
the documentation uses consistent US English spelling.

310-310: ⚡ Quick win

Remove informal conversational tone.

The phrase "by, well, default" includes an unnecessary conversational interjection.
Use direct technical language instead.

Suggested revision

-Managed vLLM serves the profile default by, well, default.
+Managed vLLM serves the profile default model.

As per coding guidelines: documentation voice should be direct and professional, avoiding conversational asides.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.md` at line 310, Replace the informal
sentence "Managed vLLM serves the profile default by, well, default." with a
direct, professional version such as "Managed vLLM serves the profile default."
— edit the sentence text in the docs to remove the conversational interjection
("by, well,") and ensure the tone is direct and technical in the surrounding
paragraph.

320-320: ⚡ Quick win

Split into one sentence per line.

Multiple sentences appear on the same line.

Suggested revision

-The slug is case-insensitive; the full Hugging Face id is also accepted.
+The slug is case-insensitive.
+The full Hugging Face id is also accepted.

As per coding guidelines: use one sentence per line in source to make diffs readable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.md` at line 320, The line "The slug is
case-insensitive; the full Hugging Face id is also accepted." contains two
sentences on one line; break them into separate sentences each on its own line
(e.g., "The slug is case-insensitive." on one line and "The full Hugging Face id
is also accepted." on the next) and replace the semicolon with a period so the
source follows the one-sentence-per-line guideline.

311-311: ⚡ Quick win

Split into one sentence per line.

Multiple sentences appear on the same line.
This makes diffs harder to read.

Suggested revision

-Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a different model from the registry; NemoClaw uses the matching `vllm serve` flags (reasoning parser, tool-call parser, `--max-model-len`).
+Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a different model from the registry.
+NemoClaw uses the matching `vllm serve` flags (reasoning parser, tool-call parser, `--max-model-len`).

As per coding guidelines: use one sentence per line in source to make diffs readable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.md` at line 311, The line containing
"Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a
different model from the registry; NemoClaw uses the matching `vllm serve` flags
(reasoning parser, tool-call parser, `--max-model-len`)." has multiple sentences
on one line—split it so each sentence is its own line, preserving the inline
code spans (`NEMOCLAW_VLLM_MODEL=<slug>`, `vllm serve`, and `--max-model-len`)
and punctuation, e.g., one line for the export instruction, one line explaining
NemoClaw uses matching vllm serve flags, and one line listing the specific
flags.

334-334: ⚡ Quick win

Split into one sentence per line.

Multiple sentences appear on the same line.

Suggested revision

-The token check runs on the host before any docker pull, so a missing or empty token aborts onboarding before bandwidth is spent on a 401.
+The token check runs on the host before any docker pull.
+A missing or empty token aborts onboarding before bandwidth is spent on a 401.

As per coding guidelines: use one sentence per line in source to make diffs readable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.md` at line 334, Split the long sentence
in docs/inference/use-local-inference.md into two separate sentences each on its
own line: change "The token check runs on the host before any docker pull, so a
missing or empty token aborts onboarding before bandwidth is spent on a 401."
into two sentences (e.g., "The token check runs on the host before any docker
pull." and "A missing or empty token aborts onboarding before bandwidth is spent
on a 401.") so each sentence occupies its own source line for clearer diffs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/vllm-models.test.ts`:
- Around line 6-12: The test is importing compiled output; change the import in
vllm-models.test.ts to point at the source module so tests reflect source edits
— replace the import from "../../../dist/lib/inference/vllm-models" with the
corresponding source path (e.g. "../../../src/lib/inference/vllm-models") so
DEFAULT_VLLM_MODEL, VLLM_MODELS, assertGatedModelAccess, buildVllmServeCommand,
and selectVllmModelFromEnv are loaded from the source files.

In `@src/lib/inference/vllm-models.ts`:
- Line 79: The Nemotron model entry in src/lib/inference/vllm-models.ts has
maxModelLen set to 262000 (a typo); update the property maxModelLen on the
Nemotron model object to 262144 to match vLLM documentation and the Qwen
registry entry so the model config uses the correct 256K token length.

---

Nitpick comments:
In `@docs/inference/use-local-inference.md`:
- Line 312: Change the British spelling "Recognised slugs:" to US English
"Recognized slugs:" in the docs string (look for the literal "Recognised slugs:"
in use-local-inference.md) so the documentation uses consistent US English
spelling.
- Line 310: Replace the informal sentence "Managed vLLM serves the profile
default by, well, default." with a direct, professional version such as "Managed
vLLM serves the profile default." — edit the sentence text in the docs to remove
the conversational interjection ("by, well,") and ensure the tone is direct and
technical in the surrounding paragraph.
- Line 320: The line "The slug is case-insensitive; the full Hugging Face id is
also accepted." contains two sentences on one line; break them into separate
sentences each on its own line (e.g., "The slug is case-insensitive." on one
line and "The full Hugging Face id is also accepted." on the next) and replace
the semicolon with a period so the source follows the one-sentence-per-line
guideline.
- Line 311: The line containing "Export `NEMOCLAW_VLLM_MODEL=<slug>` before
invoking the installer to swap in a different model from the registry; NemoClaw
uses the matching `vllm serve` flags (reasoning parser, tool-call parser,
`--max-model-len`)." has multiple sentences on one line—split it so each
sentence is its own line, preserving the inline code spans
(`NEMOCLAW_VLLM_MODEL=<slug>`, `vllm serve`, and `--max-model-len`) and
punctuation, e.g., one line for the export instruction, one line explaining
NemoClaw uses matching vllm serve flags, and one line listing the specific
flags.
- Line 334: Split the long sentence in docs/inference/use-local-inference.md
into two separate sentences each on its own line: change "The token check runs
on the host before any docker pull, so a missing or empty token aborts
onboarding before bandwidth is spent on a 401." into two sentences (e.g., "The
token check runs on the host before any docker pull." and "A missing or empty
token aborts onboarding before bandwidth is spent on a 401.") so each sentence
occupies its own source line for clearer diffs.

In `@docs/reference/commands.md`:
- Line 1156: Update the US English spelling in the table cell describing the
NEMOCLAW_VLLM_MODEL environment variable: change the word "Recognised" to
"Recognized" in the sentence that lists recognised slugs (e.g., `qwen3.6-27b`,
`nemotron-3-nano-4b`, `deepseek-r1-distill-70b`) so the description reads
"Recognized slugs: ...". Keep the rest of the text (gating note about
`HF_TOKEN`/`HUGGING_FACE_HUB_TOKEN` and default behavior) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9140b1a-2d1c-422d-b8f7-25a170058be6

📥 Commits

Reviewing files that changed from the base of the PR and between 8e2687b and 8f1d51d.

📒 Files selected for processing (6)

docs/inference/use-local-inference.md
docs/reference/commands.md
src/lib/inference/vllm-models.test.ts
src/lib/inference/vllm-models.ts
src/lib/inference/vllm.ts
test/detect-vllm-profile.test.ts

…el_len to 262144 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-16T06:36:08Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25955151084
Target ref: d13db343249e33c97afac05c117a0d1ad32e0a1e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

github-actions · 2026-05-16T06:58:54Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25955584981
Target ref: 11fef1992daac79b91eb331937a2dceffa61372e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. ---  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

feat(vllm): add NEMOCLAW_VLLM_MODEL override + gated-model check (#3566)

7329977

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the enhancement: inference label May 16, 2026

fix(vllm): forward HF_TOKEN to docker, default to profile model, docu…

8f1d51d

…ment NEMOCLAW_VLLM_MODEL Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng marked this pull request as ready for review May 16, 2026 05:56

laitingsheng added the v0.0.45 label May 16, 2026

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/lib/inference/vllm-models.test.ts

Comment thread src/lib/inference/vllm-models.ts Outdated

laitingsheng removed the v0.0.45 label May 16, 2026

laitingsheng marked this pull request as draft May 16, 2026 06:07

fix(vllm): keep HF token out of docker argv and bump Nemotron max_mod…

d13db34

…el_len to 262144 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng marked this pull request as ready for review May 16, 2026 06:33

cv approved these changes May 16, 2026

View reviewed changes

Merge branch 'main' into feat/vllm-express-model-registry

11fef19

laitingsheng added the v0.0.45 label May 16, 2026

cv merged commit 2bec0ef into main May 16, 2026
28 checks passed

miyoungc mentioned this pull request May 18, 2026

docs: update release notes for v0.0.45 #3755

Merged

12 tasks

wscurran added area: inference Inference routing, serving, model selection, or outputs feature PR adds or expands user-visible functionality and removed enhancement: inference labels Jun 3, 2026

Conversation

laitingsheng commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 16, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 16, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 16, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laitingsheng commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading

github-actions Bot commented May 16, 2026 •

edited

Loading