Skip to content

fix(vllm): raise Qwen3.6-35B-A3B-NVFP4 max-model-len to 131072 on Spark#4779

Merged
cv merged 1 commit into
mainfrom
fix/spark-vllm-context-window
Jun 4, 2026
Merged

fix(vllm): raise Qwen3.6-35B-A3B-NVFP4 max-model-len to 131072 on Spark#4779
cv merged 1 commit into
mainfrom
fix/spark-vllm-context-window

Conversation

@zyang-dev

@zyang-dev zyang-dev commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Updates the DGX Spark managed vLLM profile for the Qwen3.6 35B-A3B NVFP4 model to use a 128K context window.

Changes

  • Changed the Spark NVFP4 vLLM --max-model-len from 65536 to 131072.
  • Update the registry test to assert --max-model-len 131072.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes
    • Updated the maximum model length configuration for the Qwen3.6-35B-A3B-NVFP4 model from 65536 to 131072.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 955372cd-d39e-4fc1-96e3-a4b1b9f565f2

📥 Commits

Reviewing files that changed from the base of the PR and between af7105c and 8f7c9c4.

📒 Files selected for processing (2)
  • src/lib/inference/vllm-models.test.ts
  • src/lib/inference/vllm-models.ts

📝 Walkthrough

Walkthrough

The vLLM model registry entry for nvidia/Qwen3.6-35B-A3B-NVFP4 doubles its maxModelLen from 65536 to 131072, increasing the --max-model-len serve command flag. The test is updated to verify the new expected value.

Changes

Model Configuration and Test Update

Layer / File(s) Summary
Increase max model length and update test
src/lib/inference/vllm-models.ts, src/lib/inference/vllm-models.test.ts
The VLLM_MODELS registry entry for nvidia/Qwen3.6-35B-A3B-NVFP4 updates maxModelLen from 65536 to 131072, and the corresponding test assertion is updated to expect the new --max-model-len flag value.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4619: Both PRs update the vLLM model registry/test for the same qwen3.6-35b-a3b-nvfp4 model (in src/lib/inference/vllm-models.ts and vllm-models.test.ts), with the main PR specifically adjusting its maxModelLen (i.e., --max-model-len) to a new value.
  • NVIDIA/NemoClaw#4689: The main PR increases the vLLM --max-model-len for qwen3.6-35b-a3b-nvfp4 (thereby changing what /v1/models.max_model_len reports), and the retrieved PR reads that runtime max_model_len and applies it to NEMOCLAW_CONTEXT_WINDOW, so they directly connect through the same max_model_len value.

Suggested labels

Platform: DGX Spark, Provider: vLLM

Suggested reviewers

  • cv

Poem

🐰 A model grows twice as tall,
From 65K to twice that, all!
The test now cheers, 131072 bright,
Qwen's context expanded to new height. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: updating the max-model-len value for the Qwen3.6-35B-A3B-NVFP4 model from 65536 to 131072, which aligns with modifications in both the vllm-models.ts and vllm-models.test.ts files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spark-vllm-context-window

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Top item: No actionable findings

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: inference-routing-e2e
Optional E2E: gpu-e2e, onboard-inference-smoke-e2e

Dispatch hint: inference-routing-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

  • inference-routing-e2e (medium): The PR changes inference-serving configuration. This is the closest existing merge-blocking E2E signal for assistant inference paths: it validates onboarding-created inference routes, inference.local access from the sandbox, OpenAI-compatible chat completion routing, and error classification.

Optional E2E

  • gpu-e2e (high): Useful adjacent confidence for local GPU inference onboarding and sandbox inference behavior, although it exercises Ollama rather than managed vLLM and will not verify the changed --max-model-len value.
  • onboard-inference-smoke-e2e (low): Adjacent regression coverage that setupInference fails unless the configured route serves a real chat completion. It does not exercise managed vLLM startup but is relevant to onboarding/inference success semantics.

New E2E recommendations

  • managed vLLM DGX Spark install (high): No existing E2E appears to start the managed vLLM container with NEMOCLAW_VLLM_MODEL=qwen3.6-35b-a3b-nvfp4 and assert the generated vllm serve command includes --max-model-len 131072, reaches /v1/models, and serves a sandbox chat completion through inference.local. The current change is only directly covered by unit tests.
    • Suggested test: Add a DGX Spark/NVIDIA GPU managed-vLLM E2E job that runs non-interactive onboarding with NEMOCLAW_PROVIDER=install-vllm and NEMOCLAW_VLLM_MODEL=qwen3.6-35b-a3b-nvfp4, then verifies container readiness, command flags, and a real OpenAI-compatible chat completion from inside the sandbox.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: inference-routing-e2e

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. The PR changes the vLLM model registry and its unit tests outside test/e2e-scenario/. The current dispatchable scenario ROUTES do not include a managed-vLLM or DGX Spark scenario that would exercise NEMOCLAW_VLLM_MODEL or the updated max-model-len for the NVFP4 checkpoint; existing scenario inference coverage is cloud NVIDIA/OpenAI-compatible or local Ollama.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@zyang-dev zyang-dev added the v0.0.59 Release target label Jun 4, 2026
@cv cv merged commit 88f8d66 into main Jun 4, 2026
32 checks passed
@cv cv deleted the fix/spark-vllm-context-window branch June 4, 2026 20:50
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.

## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.

Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Updated default model for local Ollama inference setup to qwen3.5:9b
  * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows provider: vllm vLLM local or hosted provider behavior labels Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows provider: vllm vLLM local or hosted provider behavior v0.0.59 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants