Skip to content

feat(inference): switches the DGX Spark managed-vLLM profile to Qwen3.6-35B-A3B-NVFP4#4619

Merged
cv merged 2 commits into
mainfrom
feat/spark-vllm-qwen3-6-35b-nvfp4
Jun 1, 2026
Merged

feat(inference): switches the DGX Spark managed-vLLM profile to Qwen3.6-35B-A3B-NVFP4#4619
cv merged 2 commits into
mainfrom
feat/spark-vllm-qwen3-6-35b-nvfp4

Conversation

@zyang-dev

@zyang-dev zyang-dev commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Switches the DGX Spark managed-vLLM profile to NVIDIA's Qwen3.6-35B-A3B-NVFP4 checkpoint on the upstream vllm/vllm-openai nightly image, with the FlashInfer/MoE env vars and serve flags that checkpoint needs.

Changes

  • Add nvidia/Qwen3.6-35B-A3B-NVFP4 (qwen3.6-35b-a3b-nvfp4) to the vLLM model registry with its modelopt/NVFP4/MoE/speculative + tool-call/reasoning flags and --load-format fastsafetensors.
  • Extend VllmModelDef with serveEnv; buildVllmServeCommand now prepends export K=V && … for those vars (FlashInfer MoE/FP8 backend, version-check skip, CUTE_DSL_ARCH=sm_121a).
  • Point the DGX Spark profile at UPSTREAM_VLLM_IMAGE (vllm/vllm-openai:nightly-1fc2cee50…, vllm 0.21.1rc1.dev323) + the new model; pin DGX Station and generic-Linux to NGC_VLLM_IMAGE (nvcr.io/nvidia/vllm:26.03.post1-py3) so their behavior is unchanged.
  • Update tests (vllm-models.test.ts, detect-vllm-profile.test.ts) and docs (use-local-inference.mdx, commands.mdx) for the new Spark default and recognised slug.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Added support for Qwen3.6 35B-A3B NVFP4 as a managed vLLM model and made it the default for the Spark profile.
  • Documentation

    • Updated model defaults, recognized model slugs, and environment-variable mapping guidance.
  • Tests

    • Added/updated tests to cover the new model entry, default selections, and serving behavior.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev self-assigned this Jun 1, 2026
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds the Qwen3.6 35B-A3B NVFP4 model to the vLLM model registry, introduces per-model serve environment exports, updates platform-specific vLLM container images and defaults, adjusts Docker download/start invocations, and updates tests and docs to reflect the new Spark default.

Changes

Qwen3.6 35B-A3B NVFP4 Model Integration

Layer / File(s) Summary
Model definition contract and registry entry
src/lib/inference/vllm-models.ts
VllmModelDef interface adds optional serveEnv field for per-model environment exports; VLLM_MODELS registry gains Qwen3.6-35B-A3B-NVFP4 entry with model-specific serving flags and serveEnv configuration for FlashInfer/MoE backends; platform default comment updated to reference new Spark model.
Serve command generation with environment exports
src/lib/inference/vllm-models.ts
buildVllmServeCommand now prepends export key=value && ... prefix derived from model.serveEnv before the pip install and vllm serve invocation, enabling per-model environment setup at serve time.
Platform-specific images and model defaults
src/lib/inference/vllm.ts
Explicit vLLM image constants (upstream nightly and NGC) introduced; Spark profile configured with upstream image and new Qwen3.6 35B NVFP4 resolver; Station and generic Linux profiles switched to NGC image; platform-mapping comment updated.
Docker container invocation adjustments
src/lib/inference/vllm.ts
downloadModel() now uses explicit --entrypoint hf for one-shot model download; startContainer() changed to invoke serve command via /bin/bash -lc with explicit entrypoint override instead of prior bash -c form.
Model registry and command generation tests
src/lib/inference/vllm-models.test.ts
Test verifies qwen3.6-35b-a3b-nvfp4 registry entry exists and is not gated; separate test asserts buildVllmServeCommand includes required serveEnv exports, fastsafetensors dependency, NVFP4-specific flags, and shared GPU defaults without utilization override.
Profile detection integration tests
test/detect-vllm-profile.test.ts
Spark profile test assertions updated for new defaultModel.id and vLLM image tag; Station and generic Linux tests add assertions for NGC image usage.
Documentation updates for new model and slug
docs/inference/use-local-inference.mdx, docs/reference/commands.mdx
Managed vLLM profile table updated to reflect Spark's new default model; recognised-slugs table and reference docs expanded to include qwen3.6-35b-a3b-nvfp4 slug mapping; per-platform model defaults clarified.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers:

  • cv

🐰 A rabbit hops through the models with grace,
Adding Qwen's strength to the serving space,
Environments exported, containers aligned,
Platform defaults refined,
Tests passing, docs bright—the stage is set right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: switching DGX Spark's managed-vLLM profile to use Qwen3.6-35B-A3B-NVFP4, which is the primary objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/spark-vllm-qwen3-6-35b-nvfp4

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@zyang-dev zyang-dev added v0.0.57 Release target Platform: DGX Spark provider: vllm vLLM local or hosted provider behavior labels Jun 1, 2026
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: onboard-inference-smoke-e2e
Optional E2E: gpu-e2e, inference-routing-e2e, docs-validation-e2e

Dispatch hint: onboard-inference-smoke-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

  • onboard-inference-smoke-e2e (low): This PR changes local inference onboarding behavior. This regression E2E verifies onboarding does not report success until the configured inference route can serve a real chat completion and surfaces actionable diagnostics on failure. It is the closest existing merge-blocking inference-onboard guard, though it does not exercise the managed vLLM container itself.

Optional E2E

  • gpu-e2e (high): Useful live confidence for local GPU inference onboarding, sandbox reachability, and in-sandbox inference on a GPU runner. It uses Ollama rather than vLLM, so it is adjacent coverage only for this PR's managed-vLLM changes.
  • inference-routing-e2e (medium): Validates gateway inference routing, credential isolation, and error classification around configured providers. It is useful because managed vLLM ultimately configures an inference route, but it does not validate the vLLM install/start command.
  • docs-validation-e2e (low): Checks documentation and CLI/reference drift after updates to local inference and command environment-variable docs.

New E2E recommendations

  • managed-vLLM DGX Spark install (high): No existing E2E appears to run NEMOCLAW_PROVIDER=install-vllm on a Spark-class host and verify the new nvidia/Qwen3.6-35B-A3B-NVFP4 default, upstream nightly image, hf download entrypoint, serve environment exports, vLLM readiness, /v1/models detection, and sandbox chat/completions through the configured local route.
    • Suggested test: Add a DGX Spark managed-vLLM E2E job that runs non-interactive onboarding with NEMOCLAW_PROVIDER=install-vllm, asserts the selected default model/image, waits for vLLM readiness, and sends a real OpenClaw chat completion through the sandbox.
  • managed-vLLM profile matrix (medium): This PR intentionally diverges Spark from Station and generic Linux defaults. Existing unit tests cover profile selection, but there is no E2E that validates the profile-specific Docker image/model defaults on real or emulated hosts.
    • Suggested test: Add an E2E or scenario job that exercises managed-vLLM profile selection for Spark, Station, and generic NVIDIA Linux, at least in dry-run/plan mode with assertions for image, default model, Docker flags, and command line.

Dispatch hint

  • Workflow: regression-e2e.yaml
  • jobs input: onboard-inference-smoke-e2e

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. The behavioral changes are in the managed vLLM installer/model registry, but the current dispatchable scenario ROUTES table has no vLLM/install-vLLM scenario. Existing scenario routes cover cloud, OpenAI-compatible, Ollama GPU, platform, messaging, and lifecycle coverage, so no scenario E2E job would directly exercise this changed surface.

Optional scenario E2E

  • None.

Relevant changed files

  • src/lib/inference/vllm-models.ts
  • src/lib/inference/vllm.ts

@zyang-dev zyang-dev changed the title feat(inference): Switches the DGX Spark managed-vLLM profile to Qwen3.6-35B-A3B-NVFP4 feat(inference): switches the DGX Spark managed-vLLM profile to Qwen3.6-35B-A3B-NVFP4 Jun 1, 2026
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 3 worth checking, 0 nice ideas
Top item: Digest-pin or otherwise verify the new Spark vLLM image

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Digest-pin or verify the new Spark vLLM nightly image (src/lib/inference/vllm.ts:57): The DGX Spark managed-vLLM path now pulls Docker Hub image `vllm/vllm-openai:nightly-1fc2cee50a09a094b9f2bbdfcb0ab0cadb536712`. Docker tags are mutable, and this image is then run with GPU access, `--ipc=host`, a mounted Hugging Face cache, and vLLM `--trust-remote-code`. That makes the image reference an installer trust boundary for a high-risk inference/network path.
    • Recommendation: Use an immutable digest reference for the Spark image, or add an explicit image verification/update process and document why a mutable nightly tag is acceptable for this managed installer path.
    • Evidence: `UPSTREAM_VLLM_IMAGE` is a tag string at `src/lib/inference/vllm.ts:57`; `SPARK_PROFILE.image` uses it at line 119. The profile retains `--gpus all`, `--ipc=host`, and the HF cache mount, while shared vLLM args include `--trust-remote-code` in `src/lib/inference/vllm-models.ts`.
  • Quote or validate registry values before composing the shell serve command (src/lib/inference/vllm-models.ts:253): `buildVllmServeCommand()` now prepends `serveEnv` as raw `export ${key}=${value}` fragments and then joins model ids and args into a command executed through `bash -lc`. The current registry values are code-controlled literals, so this is not a user-controlled injection path today, but it creates a fragile local trust boundary where a future model entry with shell metacharacters could alter the command inside a GPU-enabled container.
    • Recommendation: Validate `serveEnv` keys against a strict environment-variable-name pattern and shell-quote values, or build the container command from argv-safe components. Add a regression test with metacharacters to prove registry additions cannot accidentally change shell structure.
    • Evidence: `serveEnv` is rendered with `.map(([key, value]) => `export ${key}=${value}`)` at line 253, and the resulting command is passed to Docker via `--entrypoint /bin/bash ... -lc ${JSON.stringify(buildVllmServeCommand(model))}` in `src/lib/inference/vllm.ts:353`.
  • Add runtime validation for the changed vLLM image and entrypoint behavior (src/lib/inference/vllm.ts:260): The tests assert profile selection and generated command strings, but the changed behavior depends on external image/runtime properties: the new upstream image must contain `hf` and `/bin/bash`, `--entrypoint hf ... download` must work for both the upstream and NGC images, and the Spark NVFP4 flags/env must boot to the ready marker on DGX Spark. This is an infrastructure path where string-level unit tests are not enough to catch image or entrypoint drift.
    • Recommendation: Add or reference targeted runtime validation for managed Spark vLLM startup, including the pre-download command and the `--entrypoint /bin/bash -lc` serve command. At minimum, add a unit-level test for the complete Docker argv/string composition and keep a DGX Spark smoke validation in the release checklist.
    • Evidence: New tests in `vllm-models.test.ts` and `detect-vllm-profile.test.ts` cover registry/profile strings. Runtime-sensitive changes are at `src/lib/inference/vllm.ts:260` (`--entrypoint hf`) and `src/lib/inference/vllm.ts:353` (`--entrypoint /bin/bash ... -lc`).

🌱 Nice ideas

  • None.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@cv cv merged commit 9b40234 into main Jun 1, 2026
36 of 39 checks passed
@cv cv deleted the feat/spark-vllm-qwen3-6-35b-nvfp4 branch June 1, 2026 22:16
@wscurran wscurran added the platform: dgx-spark Affects DGX Spark hardware or workflows label Jun 3, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the feature PR adds or expands user-visible functionality label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PR adds or expands user-visible functionality platform: dgx-spark Affects DGX Spark hardware or workflows provider: vllm vLLM local or hosted provider behavior v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants