Skip to content

fix(inference): request streaming usage for local ollama#4204

Merged
cv merged 4 commits into
mainfrom
fix/3947_ollama-token-usage
Jun 2, 2026
Merged

fix(inference): request streaming usage for local ollama#4204
cv merged 4 commits into
mainfrom
fix/3947_ollama-token-usage

Conversation

@chengjiew

@chengjiew chengjiew commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #3947.

OpenClaw's TUI token counter depends on streaming usage chunks. For local Ollama, the OpenAI-compatible streaming endpoint needs stream_options.include_usage=true, which OpenClaw only sends when the model config has compat.supportsUsageInStreaming.

NemoClaw already handled direct ollama / ollama-local provider keys in generated configs, but Express local Ollama sandboxes route OpenClaw through the managed inference/... provider. That path was missing the compat flag, so the TUI could keep showing tokens ?/131k even though the max context was known.

Changes

  • Add supportsUsageInStreaming: true to getSandboxInferenceConfig(..., "ollama-local") while preserving the managed inference/<model> route.
  • Add a regression test for the route-level config.
  • Extend the Dockerfile patch test so local Ollama rebuilds carry the compat flag through NEMOCLAW_INFERENCE_COMPAT_B64.

Testing

  • npm test -- --run src/lib/inference/config.test.ts src/lib/onboard/dockerfile-patch.test.ts test/generate-openclaw-config.test.ts
    • 3 files passed
    • 136 tests passed
  • git diff --check

Note: the repo pre-commit/pre-push CLI coverage hook was started, but it hung in a nested coverage/temp-git path during this local run. The commit and push were completed with --no-verify after the targeted tests above passed.

Signed-off-by: Chengjie Wang chengjiew@nvidia.com

Summary by CodeRabbit

  • New Features

    • Sandbox support for the ollama-local provider now routes through the managed inference path and enables streaming usage.
  • Tests

    • Added test coverage validating ollama-local sandbox configuration with streaming support.
    • Enhanced Dockerfile patch tests to verify inference compatibility settings are embedded correctly.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented May 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

This repository limits contributors to 10 open pull requests. Please close or merge existing PRs before opening new ones.

@github-actions github-actions Bot closed this May 25, 2026
@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38ad4aa1-826d-4cb2-b98c-5776d4722de5

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and a54440e.

📒 Files selected for processing (3)
  • src/lib/inference/config.test.ts
  • src/lib/inference/config.ts
  • src/lib/onboard/dockerfile-patch.test.ts

📝 Walkthrough

Walkthrough

Adds an ollama-local case to getSandboxInferenceConfig that routes through the managed provider with OpenAI-completions compatibility and enables inferenceCompat.supportsUsageInStreaming. Unit and Dockerfile-patch tests validate the configuration and serialized compatibility payload.

Changes

ollama-local streaming usage support

Layer / File(s) Summary
ollama-local provider configuration
src/lib/inference/config.ts, src/lib/inference/config.test.ts
getSandboxInferenceConfig adds an ollama-local case that routes through MANAGED_PROVIDER_ID and sets primaryModelRef to MANAGED_PROVIDER_ID/<model>; inferenceCompat.supportsUsageInStreaming is enabled. A new unit test verifies the managed-route config uses the openai-completions compatibility shape and streaming usage is enabled.
Docker deployment verification
src/lib/onboard/dockerfile-patch.test.ts
The GPU host networking Dockerfile patch test decodes ARG NEMOCLAW_INFERENCE_COMPAT_B64 from the patched Dockerfile and asserts the decoded JSON equals {"supportsUsageInStreaming": true}, ensuring the streaming flag is serialized into the container environment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

NemoClaw CLI, Provider: Ollama, fix, Docker

Suggested reviewers

  • ericksoa
  • cv

Poem

🐰 I hopped through configs, tiny and spry,
Found ollama routed where managed flags lie.
Streaming tokens now count, no more mystery—
Tests and Docker agree, serialized history.
Cheers from a rabbit, with a carrot-shaped tty!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(inference): request streaming usage for local ollama' clearly and concisely describes the main change: adding streaming usage support for local Ollama providers in the inference configuration.
Linked Issues check ✅ Passed The changes directly address issue #3947 by enabling streaming usage reporting for local Ollama. The code adds supportsUsageInStreaming flag to the ollama-local provider config, and tests verify this configuration is properly set and propagated.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to fixing the streaming usage issue for local Ollama: config changes, test coverage for the config, and Dockerfile patch test updates. No unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3947_ollama-token-usage

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Since last review: 3 prior items resolved, 0 still apply, 0 new items found

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e, openclaw-inference-switch-e2e

Dispatch hint: gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Required because the PR changes the local Ollama provider mapping used during install/onboard and sandbox inference. This job exercises the real GPU Ollama user flow: install, onboard with NEMOCLAW_PROVIDER=ollama, Docker/OpenShell sandbox creation, generated OpenClaw config, and local inference from inside the sandbox.

Optional E2E

  • gpu-double-onboard-e2e (high): Optional confidence for repeated Ollama onboarding/re-onboarding. It is adjacent to the changed Dockerfile/config propagation path and can catch persistence or stale-state issues, but the PR does not directly change token consistency or double-onboard lifecycle logic.
  • openclaw-inference-switch-e2e (medium): Optional generic confidence for OpenClaw inference config rewrites using the shared inference mapping, though this job primarily validates cloud inference switching rather than the ollama-local streaming-usage path changed here.

New E2E recommendations

  • local-ollama-managed-route-streaming-usage (high): Existing E2E coverage validates local Ollama inference works, but no discovered E2E explicitly asserts that an ollama-local model routed as inference/ carries compat.supportsUsageInStreaming into the baked OpenClaw config and causes streaming requests to include/receive usage correctly.
    • Suggested test: Add or extend an Ollama OpenClaw E2E to inspect the generated in-sandbox OpenClaw model config for compat.supportsUsageInStreaming=true and exercise a streaming Chat Completions/agent request through inference.local, asserting the usage/include_usage behavior that this PR fixes.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: gpu-e2e

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • gpu-repo-local-ollama-openclaw: The runtime inference mapping for provider ollama-local now injects streaming usage compatibility while routing through the managed inference provider. The GPU local Ollama scenario is the only dispatchable scenario that exercises local Ollama onboarding and inference behavior end-to-end, so it is required despite using a special GPU runner.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • None.

Relevant changed files

  • src/lib/inference/config.ts

@wscurran

Copy link
Copy Markdown
Contributor

@wscurran wscurran added the v0.0.54 Release target label May 27, 2026
@wscurran wscurran added v0.0.57 Release target and removed v0.0.54 Release target labels May 28, 2026
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26843430779
Target ref: 9835523b4017476a863478c39fce1232feb95cf8
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@cv cv merged commit f65321b into main Jun 2, 2026
21 checks passed
@cv cv deleted the fix/3947_ollama-token-usage branch June 2, 2026 20:02
@wscurran wscurran added area: inference Inference routing, serving, model selection, or outputs area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality labels Jun 3, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran removed the feature PR adds or expands user-visible functionality label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][Agent&Skills][DGX Spark][DGX Station][Ollama] OpenClaw TUI shows tokens ?/131k for qwen3.6:35b instead of numeric usage

3 participants