Skip to content

fix(inference): use vLLM runtime max_model_len for context window#4689

Merged
cv merged 2 commits into
mainfrom
fix/vllm-runtime-context-window
Jun 3, 2026
Merged

fix(inference): use vLLM runtime max_model_len for context window#4689
cv merged 2 commits into
mainfrom
fix/vllm-runtime-context-window

Conversation

@zyang-dev

@zyang-dev zyang-dev commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Detect vLLM's runtime context window from /v1/models.max_model_len during onboarding instead of leaving local vLLM sandboxes on NemoClaw's default context size. This lets generated OpenClaw config reflect the actual vLLM server limit.

Changes

  • Added vLLM runtime context-window parsing from /v1/models.
  • Applied detected max_model_len to NEMOCLAW_CONTEXT_WINDOW before sandbox config generation.
  • Preserved explicit user-provided context-window overrides.
  • Added focused coverage for valid, missing, malformed, over-ceiling, and model-id-matched vLLM responses.
  • Updated onboarding coverage for a running vLLM server reporting max_model_len.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Auto-detect and apply vLLM context window from model responses during setup, respecting explicit overrides and a safe maximum ceiling.
  • Bug Fixes

    • Ignore missing, malformed, non‑positive, or implausibly large reported context sizes; prefer matching modelId or fall back to the first model.
  • Tests

    • Added tests covering detection, validation, selection logic, logging, and onboarding propagation/rollback behavior.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0a54cbd0-258d-4ee3-98c4-c1e37ecffb2e

📥 Commits

Reviewing files that changed from the base of the PR and between efcbe47 and 2ce7ecf.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard-selection.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

This PR adds vLLM runtime context window auto-detection: a helper reads max_model_len from vLLM /v1/models, validates it, and sets NEMOCLAW_CONTEXT_WINDOW when appropriate. The helper is exported via local.ts, invoked during onboarding, and covered by unit and integration tests.

Changes

vLLM Runtime Context Window Auto-Detection

Layer / File(s) Summary
vLLM context window detection core
src/lib/inference/vllm-runtime-context.ts
New module exports applyVllmRuntimeContextWindow that auto-detects token limits from vLLM /v1/models max_model_len, validates values against a ceiling of 4_194_304, logs decisions, and updates NEMOCLAW_CONTEXT_WINDOW only when no explicit override exists.
Unit tests for context detection
src/lib/inference/vllm-runtime-context.test.ts
Vitest suite validates correct max_model_len application, no-op behavior for omitted values, validation warnings for malformed/non-positive/implausibly large values, model-id matching with fallback, and preservation of explicit overrides; tests capture logger output and mutated env.
Wrapper export and onboarding integration
src/lib/inference/local.ts, src/lib/onboard.ts
Wrapper exported from local.ts forwards to the new helper and onboarding calls it after vLLM model detection; a comment clarifies the --tool-call-parser behavior.
End-to-end onboarding integration tests
test/onboard-selection.test.ts
Integration tests mock vLLM /v1/models with max_model_len: 65536, verify onboarding propagates it to NEMOCLAW_CONTEXT_WINDOW and logs the used value, and add a test ensuring the detected value is not applied when validation returns to provider selection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Local Models, fix, enhancement: inference

Suggested reviewers

  • cv

Poem

🐇 I found a model list in a vLLM stream,
max_model_len shining like a dream,
I parsed with care, warned when a value was odd,
Set the window when sensible — nod, nod, nod,
Hoppity hops, the context fits the team.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(inference): use vLLM runtime max_model_len for context window' directly matches the main change—auto-detecting and applying vLLM's max_model_len to configure the context window during onboarding.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/vllm-runtime-context-window

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, inference-routing-e2e
Optional E2E: onboard-inference-smoke-e2e, gpu-e2e

Dispatch hint: cloud-onboard-e2e,inference-routing-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

  • cloud-onboard-e2e (medium): Runs the real install/onboard flow on a clean runner. Although it uses cloud inference rather than vLLM, it is the closest existing merge-blocking E2E for src/lib/onboard.ts changes and guards against regressions in provider selection/onboarding state.
  • inference-routing-e2e (medium): Validates end-to-end inference routing through OpenShell gateway/proxy paths and provider configuration. The changed local inference/context-window plumbing can affect runtime inference behavior, so this provides the best existing route-level E2E signal.

Optional E2E

  • onboard-inference-smoke-e2e (low): Hermetic regression E2E for onboard inference validation and diagnostics. Useful adjacent confidence for onboarding/inference changes, but it does not exercise the local vLLM /v1/models max_model_len path directly.
  • gpu-e2e (high): Exercises real local inference onboarding and sandbox inference on a GPU runner, but it uses Ollama rather than vLLM. Run if extra confidence is desired for local-provider wiring and context/environment interactions.

New E2E recommendations

  • local vLLM onboarding (high): No existing E2E appears to start or mock a vLLM-compatible /v1/models endpoint during real onboarding and assert that max_model_len becomes NEMOCLAW_CONTEXT_WINDOW only after validation succeeds and is preserved through sandbox inference.
    • Suggested test: Add a local-vllm-context-window-e2e job that runs onboarding with NEMOCLAW_PROVIDER=vllm against a hermetic vLLM/OpenAI-compatible mock returning /v1/models with max_model_len, verifies the configured context window, and performs an inference.local chat completion from the sandbox.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,inference-routing-e2e

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

  • None.

Optional scenario E2E

  • ubuntu-repo-cloud-openclaw: The PR changes onboarding and local inference helper code, but the modified path is specific to local vLLM runtime context detection and there is no vLLM scenario ID in the dispatchable ROUTES table. This Ubuntu repo cloud OpenClaw scenario is the closest low-cost adjacent coverage for onboarding, gateway, sandbox, and inference-local wiring.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Relevant changed files

  • src/lib/inference/local.ts
  • src/lib/inference/vllm-runtime-context.ts
  • src/lib/onboard.ts

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: src/lib/inference/vllm-runtime-context.ts tolerant parsing of vLLM /v1/models max_model_len: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The helper implements compatibility no-ops and warnings, while the analogous Ollama helper includes a comment documenting source boundary, tolerated states, regression coverage, and removal condition.
  • Document the source boundary for tolerant vLLM max_model_len parsing (src/lib/inference/vllm-runtime-context.ts:26): The helper intentionally tolerates missing, empty, malformed, non-positive, unsafe, and over-ceiling max_model_len values from the vLLM /v1/models response. The invalid states are tested, but the code does not record the source-of-truth rationale or removal condition. That leaves future maintainers without the same context already present in the analogous Ollama runtime-context helper.
    • Recommendation: Add a short comment near the parsing logic that identifies the external vLLM-compatible /v1/models boundary, the tolerated invalid states, why NemoClaw cannot enforce that producer contract in this PR, the regression coverage, and when the fallback/no-op behavior can be removed.
    • Evidence: src/lib/inference/vllm-runtime-context.ts silently returns for absent/empty max_model_len and warns for malformed or over-ceiling values; src/lib/inference/ollama-runtime-context.ts documents the same class of compatibility tolerance with source boundary, regression coverage, and removal condition.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: src/lib/inference/vllm-runtime-context.ts tolerant parsing of vLLM /v1/models max_model_len: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The helper implements compatibility no-ops and warnings, while the analogous Ollama helper includes a comment documenting source boundary, tolerated states, regression coverage, and removal condition.
  • Document the source boundary for tolerant vLLM max_model_len parsing (src/lib/inference/vllm-runtime-context.ts:26): The helper intentionally tolerates missing, empty, malformed, non-positive, unsafe, and over-ceiling max_model_len values from the vLLM /v1/models response. The invalid states are tested, but the code does not record the source-of-truth rationale or removal condition. That leaves future maintainers without the same context already present in the analogous Ollama runtime-context helper.
    • Recommendation: Add a short comment near the parsing logic that identifies the external vLLM-compatible /v1/models boundary, the tolerated invalid states, why NemoClaw cannot enforce that producer contract in this PR, the regression coverage, and when the fallback/no-op behavior can be removed.
    • Evidence: src/lib/inference/vllm-runtime-context.ts silently returns for absent/empty max_model_len and warns for malformed or over-ceiling values; src/lib/inference/ollama-runtime-context.ts documents the same class of compatibility tolerance with source boundary, regression coverage, and removal condition.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@cv cv added the v0.0.57 Release target label Jun 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/onboard-selection.test.ts (1)

570-577: ⚡ Quick win

Unset NEMOCLAW_CONTEXT_WINDOW instead of passing an empty string.

This test is meant to cover the "no explicit override" path, but NEMOCLAW_CONTEXT_WINDOW: "" still presents the variable to the child process. That can mask regressions if the production code ever distinguishes unset from empty. Build the child env first and delete the key so the fixture matches the contract you're asserting.

♻️ Proposed fix
-    const result = spawnSync(process.execPath, [scriptPath], {
+    const childEnv = {
+      ...process.env,
+      HOME: tmpDir,
+      PATH: `${fakeBin}:${process.env.PATH || ""}`,
+      NEMOCLAW_EXPERIMENTAL: "",
+      NEMOCLAW_PROVIDER: "",
+    };
+    delete childEnv.NEMOCLAW_CONTEXT_WINDOW;
+
+    const result = spawnSync(process.execPath, [scriptPath], {
       cwd: repoRoot,
       encoding: "utf-8",
-      env: {
-        ...process.env,
-        HOME: tmpDir,
-        PATH: `${fakeBin}:${process.env.PATH || ""}`,
-        NEMOCLAW_EXPERIMENTAL: "",
-        NEMOCLAW_PROVIDER: "",
-        NEMOCLAW_CONTEXT_WINDOW: "",
-      },
+      env: childEnv,
     });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 570 - 577, The test currently
sets NEMOCLAW_CONTEXT_WINDOW to an empty string in the child process env, which
differs from an unset variable; modify the test to construct the env object
(spreading process.env, HOME: tmpDir, PATH: `${fakeBin}:${process.env.PATH ||
""}`, etc.), then explicitly remove the key (delete env.NEMOCLAW_CONTEXT_WINDOW)
before passing it to the child process so the child sees the variable as unset;
update the code around the env construction in onboard-selection.test.ts (look
for the env object creation referencing tmpDir and fakeBin) to perform the
delete rather than assigning "".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 5082: The call to
localInference.applyVllmRuntimeContextWindow(vllmModels, detectedModel) mutates
global onboarding state (NEMOCLAW_CONTEXT_WINDOW) before
validateOpenAiLikeSelection(...) completes, allowing the vLLM context window to
leak if validation does a continue in selectionLoop; move or delay the mutation
so it only runs after validateOpenAiLikeSelection returns success. Specifically,
remove or defer the applyVllmRuntimeContextWindow invocation at the current spot
and instead invoke it immediately after validateOpenAiLikeSelection(...)
confirms the selection (inside the same control path where selectionLoop would
not continue), ensuring the mutation occurs only for confirmed vLLM selections
(reference localInference.applyVllmRuntimeContextWindow,
validateOpenAiLikeSelection, NEMOCLAW_CONTEXT_WINDOW, and selectionLoop).

---

Nitpick comments:
In `@test/onboard-selection.test.ts`:
- Around line 570-577: The test currently sets NEMOCLAW_CONTEXT_WINDOW to an
empty string in the child process env, which differs from an unset variable;
modify the test to construct the env object (spreading process.env, HOME:
tmpDir, PATH: `${fakeBin}:${process.env.PATH || ""}`, etc.), then explicitly
remove the key (delete env.NEMOCLAW_CONTEXT_WINDOW) before passing it to the
child process so the child sees the variable as unset; update the code around
the env construction in onboard-selection.test.ts (look for the env object
creation referencing tmpDir and fakeBin) to perform the delete rather than
assigning "".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5f4aeca-8945-41a4-b526-9d59a52e174e

📥 Commits

Reviewing files that changed from the base of the PR and between f17a19a and efcbe47.

📒 Files selected for processing (5)
  • src/lib/inference/local.ts
  • src/lib/inference/vllm-runtime-context.test.ts
  • src/lib/inference/vllm-runtime-context.ts
  • src/lib/onboard.ts
  • test/onboard-selection.test.ts

Comment thread src/lib/onboard.ts Outdated
@cv

cv commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

@zyang-dev can you verify the coderabbit/pr advisor findings, please?

@cv cv added v0.0.58 Release target and removed v0.0.57 Release target labels Jun 3, 2026
@zyang-dev zyang-dev added the provider: vllm vLLM local or hosted provider behavior label Jun 3, 2026
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev

Copy link
Copy Markdown
Contributor Author

@cv addressed coderabbit/pr advisor findings.

@cv cv merged commit ac5b144 into main Jun 3, 2026
31 checks passed
@cv cv deleted the fix/vllm-runtime-context-window branch June 3, 2026 00:59
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression provider: vllm vLLM local or hosted provider behavior v0.0.58 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants