fix(inference): enable streamed usage accounting for Ollama-local (#2747) by nvshaxie · Pull Request #3678 · NVIDIA/NemoClaw

nvshaxie · 2026-05-18T06:09:40Z

Summary

model.compat.supportsUsageInStreaming is now set when the build's NEMOCLAW_PROVIDER_KEY is ollama or ollama-local, so OpenClaw sends stream_options.include_usage: true and the TUI token counter updates after each turn instead of staying ?.
Existing explicit supportsUsageInStreaming values in the incoming inference compat blob take precedence, leaving room for a future model-specific-setup manifest to opt models out.
Regression test added in test/generate-openclaw-config.test.ts.

Root cause

OpenClaw's Ollama extension already enables streamed usage accounting if its detector (isOllamaCompatProvider in extensions/ollama/src/stream.ts upstream) recognises the endpoint. The detector requires one of: model.provider === "ollama" (exact match after normalization), or baseUrl pointing at localhost:11434, or a provider id that contains ollama reaching port 11434 on a /v1 path.

NemoClaw routes ollama-local traffic via the standardised https://inference.local/v1 URL through the OpenShell gateway. The base URL is loopback-equivalent inside the sandbox but the hostname is inference.local and the port is 443, so none of the detector's positive paths match. Without that opt-in, OpenClaw omits stream_options.include_usage, Ollama's /v1/chat/completions stream omits the trailing usage chunk, OpenClaw's parseTransportChunkUsage never runs, and model.compat.supportsUsageInStreaming stays false. End result: the TUI's tokens X/Y field never moves off ?.

Setting supportsUsageInStreaming: true directly in the compat block bypasses the detector entirely and forces the request to ask for the usage chunk. This mirrors the LM Studio extension's withLmstudioUsageCompat workaround (extensions/lmstudio/src/stream.ts upstream), which addresses the same class of bug for LM Studio's OpenAI-compatible streaming.

Test plan

npx vitest run test/generate-openclaw-config.test.ts -t "supportsUsageInStreaming" — 3 new cases pass:
- flag set for ollama and ollama-local provider keys
- flag absent for openai, anthropic, vllm, nim-local
- explicit supportsUsageInStreaming: false in the incoming compat blob is preserved
npx vitest run test/generate-openclaw-config.test.ts — 70/70 pass, no regressions

Fixes #2747.

Signed-off-by: Shawn Xie shaxie@nvidia.com

Summary by CodeRabbit

New Features
- OpenClaw streaming usage support is now automatically enabled when using Ollama and Ollama-local providers, delivering enhanced streaming performance and reliability.
Tests
- Added comprehensive tests validating that streaming usage support is properly enabled for Ollama providers, remains unaffected for non-Ollama providers, and respects explicit configurations.

) Ollama's OpenAI-compatible /v1/chat/completions stream omits the `usage` chunk by default; OpenAI clients have to send `stream_options.include_usage: true` to receive it. OpenClaw already gates that request flag on `model.compat.supportsUsageInStreaming` (see `src/agents/openai-transport-stream.ts` upstream), and the LM Studio extension force-enables it via `withLmstudioUsageCompat` to work around the same usage-chunk omission. OpenClaw's own Ollama extension only opts in when its detector recognises the endpoint as Ollama (`extensions/ollama/src/stream.ts` -> `isOllamaCompatProvider`). NemoClaw routes ollama-local traffic via the standardised `https://inference.local/v1` URL through the OpenShell gateway, so the upstream detector misses it and the OpenClaw TUI token counter stays `?` for every turn. Set `supportsUsageInStreaming` directly on the model compat block when the build's `NEMOCLAW_PROVIDER_KEY` is one of `ollama` or `ollama-local` (the same set already tracked by `_bundled_provider_plugins["ollama"]`). Cloud providers and other local backends are unaffected. Existing explicit `supportsUsageInStreaming` values in the incoming inference compat blob (e.g. from a future model-specific-setup manifest) take precedence. Test plan - `npx vitest run test/generate-openclaw-config.test.ts`

copy-pr-bot · 2026-05-18T06:09:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T06:09:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 03a134fa-972d-48dd-b0ec-8fb409e289e1

📥 Commits

Reviewing files that changed from the base of the PR and between 32dbdc5 and d32d8eb.

📒 Files selected for processing (2)

scripts/generate-openclaw-config.py
test/generate-openclaw-config.test.ts

📝 Walkthrough

Walkthrough

The PR adds provider-specific logic to force-enable streaming usage support for Ollama providers in the configuration generator and validates this behavior with three test cases covering default enablement, non-Ollama exclusion, and explicit override precedence.

Changes

OpenClaw Ollama Streaming Usage Flag

Layer / File(s)	Summary
Force-enable streaming usage for Ollama providers `scripts/generate-openclaw-config.py`	After model-specific setup effects, `inference_compat["supportsUsageInStreaming"]` is set to `True` for `ollama` and `ollama-local` provider keys, enabling OpenClaw to request streaming usage information from Ollama servers.
Validate streaming usage flag behavior `test/generate-openclaw-config.test.ts`	Three test cases assert the flag is enabled for ollama/ollama-local, remains unset for non-Ollama providers, and respects explicit `supportsUsageInStreaming: false` configuration to prevent default override.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

fix

Suggested reviewers

cv

Poem

🐰 A bunny hops through Ollama's stream,
Token counters no longer dream!
With flags set true and tests in place,
The TUI shows numbers—not a question-mark face! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix(inference): enable streamed usage accounting for Ollama-local (`#2747`)' directly and specifically summarizes the main change: enabling streamed usage accounting for Ollama-local inference.
Linked Issues check	✅ Passed	The PR fully addresses issue `#2747`'s requirements: enables streamed usage accounting for ollama-local via inference.local gateway, preserves explicit opt-out values, maintains cloud provider behavior, and provides comprehensive regression tests.
Out of Scope Changes check	✅ Passed	All changes directly support the PR objective: script modifications implement the streaming usage flag for Ollama providers, and test additions validate the specific behavior required by issue `#2747`.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ollama-stream-usage-2747

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-18T06:11:23Z

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: kimi-inference-compat-e2e, inference-routing-e2e

Dispatch hint: gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high; requires GPU runner and pulls/runs a local Ollama model): Best existing end-to-end coverage for the impacted user flow: installs NemoClaw with NEMOCLAW_PROVIDER=ollama, onboards OpenClaw with local Ollama, creates a sandbox, and verifies real direct and sandbox-routed Ollama inference. This is the closest existing E2E to the changed OpenClaw Ollama config path.

Optional E2E

kimi-inference-compat-e2e (medium): Adjacent confidence for the same config-generation compatibility machinery: it exercises model compat mutation plus OpenClaw trajectory behavior through a hermetic OpenAI-compatible endpoint. The PR should not affect Kimi, so this is useful but not merge-blocking.
inference-routing-e2e (medium): Adjacent check for inference.local routing and provider error classification. It does not cover Ollama streaming usage, but can catch unintended regressions in generated inference provider routing for non-Ollama providers.

New E2E recommendations

ollama-streaming-usage-compat (high): Existing E2E coverage validates local Ollama inference but does not assert that OpenClaw sends stream_options.include_usage=true or that streaming usage/token counts are observed for ollama-local routed through inference.local. Add coverage that inspects the generated OpenClaw config and performs a streaming OpenAI-compatible Ollama request, or drives the OpenClaw/TUI path far enough to prove usage is no longer stuck as unknown.
- Suggested test: Add an Ollama streaming usage regression to test/e2e/test-gpu-e2e.sh or a new local-ollama-inference scenario suite that verifies models.providers..models[0].compat.supportsUsageInStreaming is true and that a streaming chat completion includes usage.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e

## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. ---  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…2747) (#3683) ## Summary - Follow-up to #3678: that PR's Python-side conditional was dead code in production. `getSandboxInferenceConfig(\"model\", \"ollama-local\")` falls through to the default switch arm and returns `providerKey = \"inference\"` (the same value all managed-inference providers use), so the build-time `NEMOCLAW_PROVIDER_KEY` is never `\"ollama\"` / `\"ollama-local\"` and the conditional never fired. The earlier vitest cases only passed because they set `NEMOCLAW_PROVIDER_KEY=ollama` directly — an env shape no real onboard produces. - Move the wiring to the proper layer: add an explicit `case \"ollama-local\":` in `getSandboxInferenceConfig` that sets `inferenceCompat = { supportsUsageInStreaming: true }`. The existing `NEMOCLAW_INFERENCE_COMPAT_B64` round-trip then carries the flag into `model.compat` exactly as it does for `kimi-k2.6-managed-inference.json` and `compatible-endpoint`. - Drop the now-redundant Python conditional in `scripts/generate-openclaw-config.py`, drop the three unrealistic-fixture vitest cases in `test/generate-openclaw-config.test.ts`, and add two new ones in `src/lib/inference/config.test.ts` that exercise the real provider name. ## Evidence the previous fix was inert Direct run of `scripts/generate-openclaw-config.py` with the env values NemoClaw actually passes at docker build time for an ollama-local onboard (post-#3678 on main): \`\`\` NEMOCLAW_PROVIDER_KEY=inference NEMOCLAW_PRIMARY_MODEL_REF=inference/qwen2.5:0.5b NEMOCLAW_INFERENCE_BASE_URL=https://inference.local/v1 NEMOCLAW_INFERENCE_API=openai-completions NEMOCLAW_INFERENCE_COMPAT_B64=$(echo -n '{}' | base64) \`\`\` generates an `openclaw.json` with `provider.models[0].compat = null`. The `if provider_key in {\"ollama\", \"ollama-local\"}:` conditional never matched, so the `supportsUsageInStreaming` flag was not set. TUI token counter would stay `?` on `nemoclaw onboard --provider ollama` builds in main today. After this PR (same env, but TypeScript now also encodes `inferenceCompat = {\"supportsUsageInStreaming\": true}` into `NEMOCLAW_INFERENCE_COMPAT_B64` for ollama-local): \`\`\` NEMOCLAW_INFERENCE_COMPAT_B64=$(echo -n '{\"supportsUsageInStreaming\":true}' | base64) \`\`\` `openclaw.json` now has `provider.models[0].compat = {\"supportsUsageInStreaming\": true}` ✓. ## Test plan - \`npx vitest run src/lib/inference/config.test.ts\` — 24/24 pass, including two new cases: - \`forces supportsUsageInStreaming for ollama-local (#2747)\` - \`does not force supportsUsageInStreaming for non-ollama providers\` - \`npx vitest run test/generate-openclaw-config.test.ts\` — 67/67 pass (was 70/70 with the three deleted unrealistic cases) - Manual script run with the prod-shaped env values listed above Re-opens — and properly fixes — #2747. Signed-off-by: Shawn Xie <shaxie@nvidia.com>  ## Summary by CodeRabbit * **Refactor** * Adjusted provider configuration handling: explicit support for streaming usage reporting for the local Ollama provider and removal of a legacy compatibility override. * **Tests** * Updated tests to cover provider-specific streaming compatibility and to verify model-setup plugin activation only when relevant environment overrides are present.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3683?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Signed-off-by: Shawn Xie <shaxie@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nvshaxie requested review from cv and ericksoa May 18, 2026 06:09

nvshaxie mentioned this pull request May 18, 2026

[Station][Inference] TUI token counter shows ? after successful ollama-local inference instead of actual token usage #2747

Closed

cv approved these changes May 18, 2026

View reviewed changes

cv merged commit 50ad764 into main May 18, 2026
21 checks passed

nvshaxie mentioned this pull request May 18, 2026

fix(inference): set compat.supportsUsageInStreaming for ollama-local (#2747) #3683

Merged

miyoungc mentioned this pull request May 18, 2026

docs: update release notes for v0.0.45 #3755

Merged

12 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): enable streamed usage accounting for Ollama-local (#2747)#3678

fix(inference): enable streamed usage accounting for Ollama-local (#2747)#3678
cv merged 1 commit into
mainfrom
fix/ollama-stream-usage-2747

nvshaxie commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

github-actions Bot commented May 18, 2026

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvshaxie commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

github-actions Bot commented May 18, 2026

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvshaxie commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading