Skip to content

fix(agents): count streamed model deltas incrementally#87856

Merged
vincentkoc merged 4 commits into
mainfrom
codex/model-stream-diagnostics-delta
Jun 6, 2026
Merged

fix(agents): count streamed model deltas incrementally#87856
vincentkoc merged 4 commits into
mainfrom
codex/model-stream-diagnostics-delta

Conversation

@vincentkoc

@vincentkoc vincentkoc commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

  • Count streamed text_delta/thinking_delta/toolcall_delta diagnostics from their incremental delta bytes instead of serializing the full event.
  • Drop growing partial snapshots from fallback stream byte accounting so diagnostics do not repeatedly walk the accumulated assistant message.
  • Add regression coverage proving text-delta accounting does not serialize large partial snapshots and keeps streams alive if byte inspection throws.
  • Document the bounded responseStreamBytes semantics for high-frequency streamed deltas and align OTEL docs/metric descriptions with that bounded payload contract.

Refs #86599.

Verification

  • Rebasing onto current origin/main completed on head c1783a4997e.
  • node scripts/check-docs-mdx.mjs docs/gateway/opentelemetry.md docs/logging.md: passed.
  • node_modules/.bin/oxfmt --check --threads=1 docs/gateway/opentelemetry.md docs/logging.md extensions/diagnostics-otel/src/service.ts src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.ts: passed.
  • node scripts/run-oxlint.mjs extensions/diagnostics-otel/src/service.ts src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.ts: passed.
  • git diff --check origin/main...HEAD: passed.
  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.ts --reporter=dot: passed, 2 shards / 86 tests.
  • AWS Crabbox pnpm check:changed: passed, provider=aws, lease=cbx_fc87d734de03, run=run_dafa0367419e, exit 0; selected lanes core, coreTests, extensions, extensionTests, docs.

Real behavior proof

Behavior addressed: dense model streams no longer make model-call diagnostics serialize the full growing partial assistant message on every delta.

Real environment tested: focused local Vitest wrapper in a linked OpenClaw gwt worktree, plus AWS Crabbox Linux changed-gate proof.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.ts --reporter=dot; docs mdx, oxfmt, oxlint, diff-check; node scripts/crabbox-wrapper.mjs run --provider aws --label diagnostic-delta-accounting-check-changed --shell -- "pnpm check:changed".

Evidence after fix: focused tests passed, including assertions that text_delta accounting does not call partial.toJSON, counts incremental delta bytes, and keeps stream delivery alive when byte inspection cannot read a chunk. AWS check:changed passed the selected core, core test, extension, extension test, and docs lanes.

Observed result after fix: responseStreamBytes is computed from incremental delta bytes for high-frequency stream deltas and from snapshotless chunks for other partial-bearing stream events, while OTEL metadata and docs describe the exported histogram as bounded streamed response payload bytes.

What was not tested: a live long-running local-model session.

@vincentkoc vincentkoc self-assigned this May 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 29, 2026
@clawsweeper

clawsweeper Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 6, 2026, 10:08 AM ET / 14:08 UTC.

Summary
The PR changes embedded-agent model-call diagnostics to count text, thinking, and tool-call stream deltas from incremental delta bytes, updates focused tests, and aligns logging/OTEL docs with the bounded byte metric semantics.

PR surface: Source +18, Tests -11, Docs +2. Total +9 across 5 files.

Reproducibility: yes. from source inspection: current main serializes the snapshotless delta event shape for response byte accounting, while the PR head counts the incremental delta string. I did not run a live long-running local-model reproduction in this read-only review.

Review metrics: 1 noteworthy metric.

  • Exported byte metric semantics: 1 exported response-byte metric baseline changed. The runtime field feeds OTEL histograms and plugin hooks, so maintainers should accept the lower dense-delta values before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🌊 off-meta tidepool
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Get maintainer acceptance that the bounded response-byte metric semantics are the desired upgrade behavior before merge.

Risk before merge

  • [P1] responseStreamBytes and openclaw.model_call.response_bytes will drop for dense delta streams, so dashboards, alert thresholds, snapshots, or plugin hooks that treated the previous event-shape estimate as their baseline need maintainer acceptance.
  • [P1] The PR body explicitly notes that a live long-running local-model session was not tested; focused source coverage and the AWS changed gate support the accounting change, but they do not prove the full Windows local-provider starvation repro.

Maintainer options:

  1. Accept Bounded Metric Semantics (recommended)
    Land the PR if maintainers are comfortable making responseStreamBytes and openclaw.model_call.response_bytes measure incremental streamed payload bytes for dense deltas.
  2. Preserve Historical Metric Baseline
    If existing dashboards or plugin hooks need the old event-shape estimate, revise before merge by keeping that value separately or documenting a migration path for observability consumers.

Next step before merge

  • [P2] Protected maintainer PR with a compatibility-sensitive exported metric semantic change; the next action is maintainer acceptance rather than an automated repair.

Security
Cleared: No security or supply-chain concern found; the patch touches diagnostic byte accounting, tests, and docs without changing dependencies, workflows, secrets, permissions, or raw content export.

Review details

Best possible solution:

Land this after maintainers explicitly accept the bounded response-byte metric semantics, or preserve the old estimate under a separate metric if existing observability consumers need it.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main serializes the snapshotless delta event shape for response byte accounting, while the PR head counts the incremental delta string. I did not run a live long-running local-model reproduction in this read-only review.

Is this the best way to solve the issue?

Yes, this is the best scoped fix for the diagnostic accounting path: it changes the central embedded-stream diagnostic wrapper and updates the exported docs/OTEL description using the existing stream event contract. The remaining question is compatibility acceptance for the changed metric baseline, not a better code location.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5b84ebfc56c9.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🦞 diamond lobster.
  • remove rating: 🦞 diamond lobster: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P1: This PR targets the P1 dense local-provider stream responsiveness work by removing repeated diagnostic serialization from hot model stream deltas.
  • merge-risk: 🚨 compatibility: Merging changes the observable byte values for responseStreamBytes and openclaw.model_call.response_bytes, which can affect existing dashboards, thresholds, and plugin hook consumers.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external-contributor real-proof gate is not applicable to this maintainer PR; the PR body still records focused tests, docs/lint checks, and an AWS Crabbox changed gate, with a live long-running local-model session left untested.
Evidence reviewed

PR surface:

Source +18, Tests -11, Docs +2. Total +9 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 2 21 3 +18
Tests 1 2 13 -11
Docs 2 4 2 +2
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 27 18 +9

What I checked:

Likely related people:

  • vincentkoc: Authored the PR commits for this diagnostic accounting change and also carried the related merged Ollama dense-stream yield fix referenced from the same root issue. (role: recent area contributor and current PR implementer; confidence: high; commits: 311f49ad5adb, 347f5430ac5c, c7b190beec73; files: src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.ts, src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts, docs/logging.md)
  • joshavant: Current-main blame in this checkout points the existing diagnostic accounting helper, OTEL metric description, and related docs to a broad commit by this author, so they are a useful routing signal for the current baseline. (role: current-main source provenance; confidence: medium; commits: fbaa5a6f0a57; files: src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.ts, extensions/diagnostics-otel/src/service.ts, docs/logging.md)
  • osolmaz: Provided maintainer context that this PR is split out from the broader dense-stream work and should preserve the reviewer and attribution history from the closed broader branch. (role: related split-stack contributor; confidence: medium; commits: aac174b329d4; files: src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.ts, src/agents/runtime/proxy.ts, src/plugin-sdk/provider-stream-shared.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added the rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. label May 29, 2026
@vincentkoc vincentkoc force-pushed the codex/model-stream-diagnostics-delta branch from e5c4273 to 79c1a48 Compare May 29, 2026 03:37
@vincentkoc vincentkoc marked this pull request as ready for review May 29, 2026 03:38
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 29, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label May 29, 2026
@vincentkoc vincentkoc force-pushed the codex/model-stream-diagnostics-delta branch 2 times, most recently from 8e88916 to 93683b4 Compare May 29, 2026 19:23
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 29, 2026
@vincentkoc vincentkoc force-pushed the codex/model-stream-diagnostics-delta branch 4 times, most recently from d025738 to 8e7c6b7 Compare May 29, 2026 20:27
@clawsweeper clawsweeper Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026
@openclaw-barnacle openclaw-barnacle Bot added the gateway Gateway runtime label May 31, 2026
@clawsweeper clawsweeper Bot removed the status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. label May 31, 2026
@clawsweeper clawsweeper Bot added the status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. label May 31, 2026
@vincentkoc

Copy link
Copy Markdown
Member Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels May 31, 2026
@vincentkoc vincentkoc force-pushed the codex/model-stream-diagnostics-delta branch 2 times, most recently from 1888fdf to c1783a4 Compare June 1, 2026 02:05
@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels Jun 1, 2026
@osolmaz

osolmaz commented Jun 1, 2026

Copy link
Copy Markdown
Member

Attribution note: this diagnostics PR is one of the split-out pieces from the broader #87558 / #86599 dense-stream work. #87558 carried the same production direction: keep hot-path stream accounting proportional to new deltas instead of repeatedly walking full growing partial snapshots.

#87558 is being closed in favor of the smaller split PR stack. Please keep the #87558 context linked for attribution and reviewer history.

@vincentkoc vincentkoc force-pushed the codex/model-stream-diagnostics-delta branch from 9fd5ad3 to 347f543 Compare June 6, 2026 14:01
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. labels Jun 6, 2026
@vincentkoc

Copy link
Copy Markdown
Member Author

Land-ready verification for head 347f5430ac5c26c1de11c902b4db252660d9df92.

  • Local focused proof: node scripts/run-vitest.mjs run src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.ts -> 1 file passed, 13 tests passed.
  • CI proof: GitHub Actions run 27064304709 is green; CodeQL run 27064304710 is green; OpenGrep PR diff run 27064304716 is green.
  • Maintainer decision: accepting the intentional diagnostics baseline change noted by ClawSweeper. responseStreamBytes / openclaw.model_call.response_bytes should count incremental streamed deltas instead of the prior event-shape estimate for dense streams.
  • Proof gap: no live long-running local-model session was rerun in this landing pass; the regression is covered by focused diagnostics tests plus the green PR matrix.

@vincentkoc vincentkoc merged commit 4af444a into main Jun 6, 2026
168 of 170 checks passed
@vincentkoc vincentkoc deleted the codex/model-stream-diagnostics-delta branch June 6, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation extensions: diagnostics-otel Extension: diagnostics-otel gateway Gateway runtime maintainer Maintainer-authored PR merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: XS status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants