feat(vllm): add NEMOCLAW_VLLM_MODEL override + gated-model check#3642
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR implements the ChangesvLLM Model Registry and Selection
vLLM Container and Orchestration Integration
User-Facing Documentation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
…ment NEMOCLAW_VLLM_MODEL Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
docs/reference/commands.md (1)
1156-1156: ⚡ Quick winUse US English spelling.
"Recognised" uses British spelling.
Prefer "Recognized" for consistency with US English conventions.Suggested revision
-| `NEMOCLAW_VLLM_MODEL` | registry slug or Hugging Face model id | Selects the model the managed-vLLM install path serves. Recognised slugs: `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`. Unset uses the per-platform profile default. Gated models (e.g. `deepseek-r1-distill-70b`) require `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN`. | +| `NEMOCLAW_VLLM_MODEL` | registry slug or Hugging Face model id | Selects the model the managed-vLLM install path serves. Recognized slugs: `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`. Unset uses the per-platform profile default. Gated models (e.g. `deepseek-r1-distill-70b`) require `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN`. |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.md` at line 1156, Update the US English spelling in the table cell describing the NEMOCLAW_VLLM_MODEL environment variable: change the word "Recognised" to "Recognized" in the sentence that lists recognised slugs (e.g., `qwen3.6-27b`, `nemotron-3-nano-4b`, `deepseek-r1-distill-70b`) so the description reads "Recognized slugs: ...". Keep the rest of the text (gating note about `HF_TOKEN`/`HUGGING_FACE_HUB_TOKEN` and default behavior) unchanged.docs/inference/use-local-inference.md (5)
312-312: ⚡ Quick winUse US English spelling.
"Recognised" uses British spelling.
Prefer "Recognized" for consistency with US English conventions.Suggested revision
-Recognised slugs: +Recognized slugs:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.md` at line 312, Change the British spelling "Recognised slugs:" to US English "Recognized slugs:" in the docs string (look for the literal "Recognised slugs:" in use-local-inference.md) so the documentation uses consistent US English spelling.
310-310: ⚡ Quick winRemove informal conversational tone.
The phrase "by, well, default" includes an unnecessary conversational interjection.
Use direct technical language instead.Suggested revision
-Managed vLLM serves the profile default by, well, default. +Managed vLLM serves the profile default model.As per coding guidelines: documentation voice should be direct and professional, avoiding conversational asides.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.md` at line 310, Replace the informal sentence "Managed vLLM serves the profile default by, well, default." with a direct, professional version such as "Managed vLLM serves the profile default." — edit the sentence text in the docs to remove the conversational interjection ("by, well,") and ensure the tone is direct and technical in the surrounding paragraph.
320-320: ⚡ Quick winSplit into one sentence per line.
Multiple sentences appear on the same line.
Suggested revision
-The slug is case-insensitive; the full Hugging Face id is also accepted. +The slug is case-insensitive. +The full Hugging Face id is also accepted.As per coding guidelines: use one sentence per line in source to make diffs readable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.md` at line 320, The line "The slug is case-insensitive; the full Hugging Face id is also accepted." contains two sentences on one line; break them into separate sentences each on its own line (e.g., "The slug is case-insensitive." on one line and "The full Hugging Face id is also accepted." on the next) and replace the semicolon with a period so the source follows the one-sentence-per-line guideline.
311-311: ⚡ Quick winSplit into one sentence per line.
Multiple sentences appear on the same line.
This makes diffs harder to read.Suggested revision
-Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a different model from the registry; NemoClaw uses the matching `vllm serve` flags (reasoning parser, tool-call parser, `--max-model-len`). +Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a different model from the registry. +NemoClaw uses the matching `vllm serve` flags (reasoning parser, tool-call parser, `--max-model-len`).As per coding guidelines: use one sentence per line in source to make diffs readable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.md` at line 311, The line containing "Export `NEMOCLAW_VLLM_MODEL=<slug>` before invoking the installer to swap in a different model from the registry; NemoClaw uses the matching `vllm serve` flags (reasoning parser, tool-call parser, `--max-model-len`)." has multiple sentences on one line—split it so each sentence is its own line, preserving the inline code spans (`NEMOCLAW_VLLM_MODEL=<slug>`, `vllm serve`, and `--max-model-len`) and punctuation, e.g., one line for the export instruction, one line explaining NemoClaw uses matching vllm serve flags, and one line listing the specific flags.
334-334: ⚡ Quick winSplit into one sentence per line.
Multiple sentences appear on the same line.
Suggested revision
-The token check runs on the host before any docker pull, so a missing or empty token aborts onboarding before bandwidth is spent on a 401. +The token check runs on the host before any docker pull. +A missing or empty token aborts onboarding before bandwidth is spent on a 401.As per coding guidelines: use one sentence per line in source to make diffs readable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.md` at line 334, Split the long sentence in docs/inference/use-local-inference.md into two separate sentences each on its own line: change "The token check runs on the host before any docker pull, so a missing or empty token aborts onboarding before bandwidth is spent on a 401." into two sentences (e.g., "The token check runs on the host before any docker pull." and "A missing or empty token aborts onboarding before bandwidth is spent on a 401.") so each sentence occupies its own source line for clearer diffs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/vllm-models.test.ts`:
- Around line 6-12: The test is importing compiled output; change the import in
vllm-models.test.ts to point at the source module so tests reflect source edits
— replace the import from "../../../dist/lib/inference/vllm-models" with the
corresponding source path (e.g. "../../../src/lib/inference/vllm-models") so
DEFAULT_VLLM_MODEL, VLLM_MODELS, assertGatedModelAccess, buildVllmServeCommand,
and selectVllmModelFromEnv are loaded from the source files.
In `@src/lib/inference/vllm-models.ts`:
- Line 79: The Nemotron model entry in src/lib/inference/vllm-models.ts has
maxModelLen set to 262000 (a typo); update the property maxModelLen on the
Nemotron model object to 262144 to match vLLM documentation and the Qwen
registry entry so the model config uses the correct 256K token length.
---
Nitpick comments:
In `@docs/inference/use-local-inference.md`:
- Line 312: Change the British spelling "Recognised slugs:" to US English
"Recognized slugs:" in the docs string (look for the literal "Recognised slugs:"
in use-local-inference.md) so the documentation uses consistent US English
spelling.
- Line 310: Replace the informal sentence "Managed vLLM serves the profile
default by, well, default." with a direct, professional version such as "Managed
vLLM serves the profile default." — edit the sentence text in the docs to remove
the conversational interjection ("by, well,") and ensure the tone is direct and
technical in the surrounding paragraph.
- Line 320: The line "The slug is case-insensitive; the full Hugging Face id is
also accepted." contains two sentences on one line; break them into separate
sentences each on its own line (e.g., "The slug is case-insensitive." on one
line and "The full Hugging Face id is also accepted." on the next) and replace
the semicolon with a period so the source follows the one-sentence-per-line
guideline.
- Line 311: The line containing "Export `NEMOCLAW_VLLM_MODEL=<slug>` before
invoking the installer to swap in a different model from the registry; NemoClaw
uses the matching `vllm serve` flags (reasoning parser, tool-call parser,
`--max-model-len`)." has multiple sentences on one line—split it so each
sentence is its own line, preserving the inline code spans
(`NEMOCLAW_VLLM_MODEL=<slug>`, `vllm serve`, and `--max-model-len`) and
punctuation, e.g., one line for the export instruction, one line explaining
NemoClaw uses matching vllm serve flags, and one line listing the specific
flags.
- Line 334: Split the long sentence in docs/inference/use-local-inference.md
into two separate sentences each on its own line: change "The token check runs
on the host before any docker pull, so a missing or empty token aborts
onboarding before bandwidth is spent on a 401." into two sentences (e.g., "The
token check runs on the host before any docker pull." and "A missing or empty
token aborts onboarding before bandwidth is spent on a 401.") so each sentence
occupies its own source line for clearer diffs.
In `@docs/reference/commands.md`:
- Line 1156: Update the US English spelling in the table cell describing the
NEMOCLAW_VLLM_MODEL environment variable: change the word "Recognised" to
"Recognized" in the sentence that lists recognised slugs (e.g., `qwen3.6-27b`,
`nemotron-3-nano-4b`, `deepseek-r1-distill-70b`) so the description reads
"Recognized slugs: ...". Keep the rest of the text (gating note about
`HF_TOKEN`/`HUGGING_FACE_HUB_TOKEN` and default behavior) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e9140b1a-2d1c-422d-b8f7-25a170058be6
📒 Files selected for processing (6)
docs/inference/use-local-inference.mddocs/reference/commands.mdsrc/lib/inference/vllm-models.test.tssrc/lib/inference/vllm-models.tssrc/lib/inference/vllm.tstest/detect-vllm-profile.test.ts
…el_len to 262144 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Adds a scoped escape hatch on the express vLLM install path: setting
NEMOCLAW_VLLM_MODEL=<slug>picks a different entry from a small new registry, swaps in the rightvllm serveflags, and fails fast when a gated model is requested without a Hugging Face token (which is now forwarded into the managed-vLLM container sohf downloadandvllm servecan authenticate). An interactive model picker on the express path is not part of this PR — that remains a separate design item; the env-var path covers the gated-model coverage QA was trying to exercise.Related Issue
Fixes #3566
Fixes #3572
Changes
src/lib/inference/vllm-models.tswith a typed registry, env-var resolver, gated-access check, and a sharedbuildVllmServeCommandthat merges common flags with the model-specific args.nullwhenNEMOCLAW_VLLM_MODELis unset so the caller can keep the per-platform profile default (the generic-Linux profile must stay on Nemotron-Nano-4B for VRAM headroom).VllmProfileto carry adefaultModel: VllmModelDefinstead of staticmodel/commandstrings;installVllmresolves the model per call, validates gated access, builds the serve command on the fly, and forwardsHF_TOKEN/HUGGING_FACE_HUB_TOKENinto both thehf downloadone-shot and the long-livedvllm servecontainer using the-e KEY(key-only) form so the secret stays out of the docker argv; the value is added back via the runner'senvoption so docker can inherit it without breaking the subprocess-env allowlist.use-local-inference.md; add it to the env-var reference table incommands.mdto satisfy the env-var doc gate.detect-vllm-profile.test.tsto assert againstprofile.defaultModel.idand add coverage forbuildHfTokenDockerArgs+buildHfTokenForwardEnv(HF_TOKEN preferred, HUGGING_FACE_HUB_TOKEN fallback, empty/whitespace handled).vllm-models.test.tscovering the env-var resolver (slug + HF id forms, unset-returns-null), unknown-value error, gated-access check, and command-builder output.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Release Notes
New Features
Documentation