fix(inference): use vLLM runtime max_model_len for context window by zyang-dev · Pull Request #4689 · NVIDIA/NemoClaw

zyang-dev · 2026-06-02T23:54:50Z

Summary

Detect vLLM's runtime context window from /v1/models.max_model_len during onboarding instead of leaving local vLLM sandboxes on NemoClaw's default context size. This lets generated OpenClaw config reflect the actual vLLM server limit.

Changes

Added vLLM runtime context-window parsing from /v1/models.
Applied detected max_model_len to NEMOCLAW_CONTEXT_WINDOW before sandbox config generation.
Preserved explicit user-provided context-window overrides.
Added focused coverage for valid, missing, malformed, over-ceiling, and model-id-matched vLLM responses.
Updated onboarding coverage for a running vLLM server reporting max_model_len.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

New Features
- Auto-detect and apply vLLM context window from model responses during setup, respecting explicit overrides and a safe maximum ceiling.
Bug Fixes
- Ignore missing, malformed, non‑positive, or implausibly large reported context sizes; prefer matching modelId or fall back to the first model.
Tests
- Added tests covering detection, validation, selection logic, logging, and onboarding propagation/rollback behavior.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai · 2026-06-02T23:55:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0a54cbd0-258d-4ee3-98c4-c1e37ecffb2e

📥 Commits

Reviewing files that changed from the base of the PR and between efcbe47 and 2ce7ecf.

📒 Files selected for processing (2)

src/lib/onboard.ts
test/onboard-selection.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard.ts

📝 Walkthrough

Walkthrough

This PR adds vLLM runtime context window auto-detection: a helper reads max_model_len from vLLM /v1/models, validates it, and sets NEMOCLAW_CONTEXT_WINDOW when appropriate. The helper is exported via local.ts, invoked during onboarding, and covered by unit and integration tests.

Changes

vLLM Runtime Context Window Auto-Detection

Layer / File(s)	Summary
vLLM context window detection core `src/lib/inference/vllm-runtime-context.ts`	New module exports `applyVllmRuntimeContextWindow` that auto-detects token limits from vLLM `/v1/models` `max_model_len`, validates values against a ceiling of `4_194_304`, logs decisions, and updates `NEMOCLAW_CONTEXT_WINDOW` only when no explicit override exists.
Unit tests for context detection `src/lib/inference/vllm-runtime-context.test.ts`	Vitest suite validates correct `max_model_len` application, no-op behavior for omitted values, validation warnings for malformed/non-positive/implausibly large values, model-id matching with fallback, and preservation of explicit overrides; tests capture logger output and mutated env.
Wrapper export and onboarding integration `src/lib/inference/local.ts`, `src/lib/onboard.ts`	Wrapper exported from `local.ts` forwards to the new helper and onboarding calls it after vLLM model detection; a comment clarifies the `--tool-call-parser` behavior.
End-to-end onboarding integration tests `test/onboard-selection.test.ts`	Integration tests mock vLLM `/v1/models` with `max_model_len: 65536`, verify onboarding propagates it to `NEMOCLAW_CONTEXT_WINDOW` and logs the used value, and add a test ensuring the detected value is not applied when validation returns to provider selection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Local Models, fix, enhancement: inference

Suggested reviewers

cv

Poem

🐇 I found a model list in a vLLM stream,
max_model_len shining like a dream,
I parsed with care, warned when a value was odd,
Set the window when sensible — nod, nod, nod,
Hoppity hops, the context fits the team.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(inference): use vLLM runtime max_model_len for context window' directly matches the main change—auto-detecting and applying vLLM's max_model_len to configure the context window during onboarding.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/vllm-runtime-context-window

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-02T23:57:10Z

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, inference-routing-e2e
Optional E2E: onboard-inference-smoke-e2e, gpu-e2e

Dispatch hint: cloud-onboard-e2e,inference-routing-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

cloud-onboard-e2e (medium): Runs the real install/onboard flow on a clean runner. Although it uses cloud inference rather than vLLM, it is the closest existing merge-blocking E2E for src/lib/onboard.ts changes and guards against regressions in provider selection/onboarding state.
inference-routing-e2e (medium): Validates end-to-end inference routing through OpenShell gateway/proxy paths and provider configuration. The changed local inference/context-window plumbing can affect runtime inference behavior, so this provides the best existing route-level E2E signal.

Optional E2E

onboard-inference-smoke-e2e (low): Hermetic regression E2E for onboard inference validation and diagnostics. Useful adjacent confidence for onboarding/inference changes, but it does not exercise the local vLLM /v1/models max_model_len path directly.
gpu-e2e (high): Exercises real local inference onboarding and sandbox inference on a GPU runner, but it uses Ollama rather than vLLM. Run if extra confidence is desired for local-provider wiring and context/environment interactions.

New E2E recommendations

local vLLM onboarding (high): No existing E2E appears to start or mock a vLLM-compatible /v1/models endpoint during real onboarding and assert that max_model_len becomes NEMOCLAW_CONTEXT_WINDOW only after validation succeeds and is preserved through sandbox inference.
- Suggested test: Add a local-vllm-context-window-e2e job that runs onboarding with NEMOCLAW_PROVIDER=vllm against a hermetic vLLM/OpenAI-compatible mock returning /v1/models with max_model_len, verifies the configured context window, and performs an inference.local chat completion from the sandbox.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: cloud-onboard-e2e,inference-routing-e2e

github-actions · 2026-06-02T23:57:11Z

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

None.

Optional scenario E2E

ubuntu-repo-cloud-openclaw: The PR changes onboarding and local inference helper code, but the modified path is specific to local vLLM runtime context detection and there is no vLLM scenario ID in the dispatchable ROUTES table. This Ubuntu repo cloud OpenClaw scenario is the closest low-cost adjacent coverage for onboarding, gateway, sandbox, and inference-local wiring.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Relevant changed files

src/lib/inference/local.ts
src/lib/inference/vllm-runtime-context.ts
src/lib/onboard.ts

github-actions · 2026-06-02T23:58:46Z

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Source-of-truth review needed: src/lib/inference/vllm-runtime-context.ts tolerant parsing of vLLM /v1/models max_model_len: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: The helper implements compatibility no-ops and warnings, while the analogous Ollama helper includes a comment documenting source boundary, tolerated states, regression coverage, and removal condition.
Document the source boundary for tolerant vLLM max_model_len parsing (src/lib/inference/vllm-runtime-context.ts:26): The helper intentionally tolerates missing, empty, malformed, non-positive, unsafe, and over-ceiling max_model_len values from the vLLM /v1/models response. The invalid states are tested, but the code does not record the source-of-truth rationale or removal condition. That leaves future maintainers without the same context already present in the analogous Ollama runtime-context helper.
- Recommendation: Add a short comment near the parsing logic that identifies the external vLLM-compatible /v1/models boundary, the tolerated invalid states, why NemoClaw cannot enforce that producer contract in this PR, the regression coverage, and when the fallback/no-op behavior can be removed.
- Evidence: src/lib/inference/vllm-runtime-context.ts silently returns for absent/empty max_model_len and warns for malformed or over-ceiling values; src/lib/inference/ollama-runtime-context.ts documents the same class of compatibility tolerance with source boundary, regression coverage, and removal condition.

🌱 Nice ideas

None.

Since last review details

Current findings:

Source-of-truth review needed: src/lib/inference/vllm-runtime-context.ts tolerant parsing of vLLM /v1/models max_model_len: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: The helper implements compatibility no-ops and warnings, while the analogous Ollama helper includes a comment documenting source boundary, tolerated states, regression coverage, and removal condition.
Document the source boundary for tolerant vLLM max_model_len parsing (src/lib/inference/vllm-runtime-context.ts:26): The helper intentionally tolerates missing, empty, malformed, non-positive, unsafe, and over-ceiling max_model_len values from the vLLM /v1/models response. The invalid states are tested, but the code does not record the source-of-truth rationale or removal condition. That leaves future maintainers without the same context already present in the analogous Ollama runtime-context helper.
- Recommendation: Add a short comment near the parsing logic that identifies the external vLLM-compatible /v1/models boundary, the tolerated invalid states, why NemoClaw cannot enforce that producer contract in this PR, the regression coverage, and when the fallback/no-op behavior can be removed.
- Evidence: src/lib/inference/vllm-runtime-context.ts silently returns for absent/empty max_model_len and warns for malformed or over-ceiling values; src/lib/inference/ollama-runtime-context.ts documents the same class of compatibility tolerance with source boundary, regression coverage, and removal condition.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/onboard-selection.test.ts (1)

570-577: ⚡ Quick win

Unset NEMOCLAW_CONTEXT_WINDOW instead of passing an empty string.

This test is meant to cover the "no explicit override" path, but NEMOCLAW_CONTEXT_WINDOW: "" still presents the variable to the child process. That can mask regressions if the production code ever distinguishes unset from empty. Build the child env first and delete the key so the fixture matches the contract you're asserting.

♻️ Proposed fix

-    const result = spawnSync(process.execPath, [scriptPath], {
+    const childEnv = {
+      ...process.env,
+      HOME: tmpDir,
+      PATH: `${fakeBin}:${process.env.PATH || ""}`,
+      NEMOCLAW_EXPERIMENTAL: "",
+      NEMOCLAW_PROVIDER: "",
+    };
+    delete childEnv.NEMOCLAW_CONTEXT_WINDOW;
+
+    const result = spawnSync(process.execPath, [scriptPath], {
       cwd: repoRoot,
       encoding: "utf-8",
-      env: {
-        ...process.env,
-        HOME: tmpDir,
-        PATH: `${fakeBin}:${process.env.PATH || ""}`,
-        NEMOCLAW_EXPERIMENTAL: "",
-        NEMOCLAW_PROVIDER: "",
-        NEMOCLAW_CONTEXT_WINDOW: "",
-      },
+      env: childEnv,
     });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 570 - 577, The test currently
sets NEMOCLAW_CONTEXT_WINDOW to an empty string in the child process env, which
differs from an unset variable; modify the test to construct the env object
(spreading process.env, HOME: tmpDir, PATH: `${fakeBin}:${process.env.PATH ||
""}`, etc.), then explicitly remove the key (delete env.NEMOCLAW_CONTEXT_WINDOW)
before passing it to the child process so the child sees the variable as unset;
update the code around the env construction in onboard-selection.test.ts (look
for the env object creation referencing tmpDir and fakeBin) to perform the
delete rather than assigning "".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 5082: The call to
localInference.applyVllmRuntimeContextWindow(vllmModels, detectedModel) mutates
global onboarding state (NEMOCLAW_CONTEXT_WINDOW) before
validateOpenAiLikeSelection(...) completes, allowing the vLLM context window to
leak if validation does a continue in selectionLoop; move or delay the mutation
so it only runs after validateOpenAiLikeSelection returns success. Specifically,
remove or defer the applyVllmRuntimeContextWindow invocation at the current spot
and instead invoke it immediately after validateOpenAiLikeSelection(...)
confirms the selection (inside the same control path where selectionLoop would
not continue), ensuring the mutation occurs only for confirmed vLLM selections
(reference localInference.applyVllmRuntimeContextWindow,
validateOpenAiLikeSelection, NEMOCLAW_CONTEXT_WINDOW, and selectionLoop).

---

Nitpick comments:
In `@test/onboard-selection.test.ts`:
- Around line 570-577: The test currently sets NEMOCLAW_CONTEXT_WINDOW to an
empty string in the child process env, which differs from an unset variable;
modify the test to construct the env object (spreading process.env, HOME:
tmpDir, PATH: `${fakeBin}:${process.env.PATH || ""}`, etc.), then explicitly
remove the key (delete env.NEMOCLAW_CONTEXT_WINDOW) before passing it to the
child process so the child sees the variable as unset; update the code around
the env construction in onboard-selection.test.ts (look for the env object
creation referencing tmpDir and fakeBin) to perform the delete rather than
assigning "".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5f4aeca-8945-41a4-b526-9d59a52e174e

📥 Commits

Reviewing files that changed from the base of the PR and between f17a19a and efcbe47.

📒 Files selected for processing (5)

src/lib/inference/local.ts
src/lib/inference/vllm-runtime-context.test.ts
src/lib/inference/vllm-runtime-context.ts
src/lib/onboard.ts
test/onboard-selection.test.ts

cv · 2026-06-03T00:03:24Z

@zyang-dev can you verify the coderabbit/pr advisor findings, please?

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

zyang-dev · 2026-06-03T00:21:35Z

@cv addressed coderabbit/pr advisor findings.

fix(inference): use vLLM runtime max_model_len for context window

efcbe47

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

cv approved these changes Jun 2, 2026

View reviewed changes

cv added the v0.0.57 Release target label Jun 2, 2026

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

cv added v0.0.58 Release target and removed v0.0.57 Release target labels Jun 3, 2026

zyang-dev added the provider: vllm vLLM local or hosted provider behavior label Jun 3, 2026

fix(onboard): apply vLLM context window after validation

2ce7ecf

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

cv merged commit ac5b144 into main Jun 3, 2026
31 checks passed

cv deleted the fix/vllm-runtime-context-window branch June 3, 2026 00:59

miyoungc mentioned this pull request Jun 4, 2026

docs: refresh 0.0.58 release docs and refresh skills #4743

Merged

coderabbitai Bot mentioned this pull request Jun 4, 2026

fix(vllm): raise Qwen3.6-35B-A3B-NVFP4 max-model-len to 131072 on Spark #4779

Merged

12 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Conversation

zyang-dev commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cv commented Jun 3, 2026

Uh oh!

zyang-dev commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyang-dev commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading