Skip to content

fix(gateway): clear stale chat stream buffers#75089

Merged
osolmaz merged 2 commits into
openclaw:mainfrom
litang9:codex/fix-gateway-buffer-cleanup
May 29, 2026
Merged

fix(gateway): clear stale chat stream buffers#75089
osolmaz merged 2 commits into
openclaw:mainfrom
litang9:codex/fix-gateway-buffer-cleanup

Conversation

@litang9

@litang9 litang9 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Clear gateway chat raw/projected/delta buffers through one clearRun path on normal finish, abort, timeout, and stale maintenance cleanup.
  • Track bufferUpdatedAt so raw-only streamed buffers that never emitted deltas can be swept after their run is no longer active.
  • Include current-main broadcast text and agent text throttle state in the same per-run cleanup path.

Root cause

Gateway rawBuffers can retain full streamed assistant text. Some cleanup paths removed only projected buffers and delta metadata, and suppressed/raw-only output could avoid deltaSentAt entirely, so maintenance could miss it.

Impact

When long or frequent streamed chat runs hit abort, timeout, suppressed-output, or missing terminal-event paths, the retained raw buffers can accumulate in the gateway process and cause gateway memory to grow over time. This PR ensures those per-run buffers are cleaned consistently so stale chat state does not become a gateway memory leak.

Related Issues

Related to #51821, which reported stale chatRunState.buffers cleanup gaps for stuck runs.

Contributes to the broader Gateway RSS/session accumulation tracker #54155, but does not fully close that tracker.

Review follow-up

  • Refreshed the branch onto current main so the PR no longer relies on the stale Apr 30 base.
  • Included current-main deltaLastBroadcastText, agentDeltaSentAt, and bufferedAgentEvents in ChatRunState.clearRun() and full-state cleanup.
  • Addressed the original ClawSweeper P2 by removing the unrelated age-based toolStartData sweep from this PR. The final diff stays focused on Gateway chat buffer cleanup and does not delete active tool-start records before their matching end event.
  • Added the required CHANGELOG.md entry for the user-facing Gateway chat buffer memory-growth fix.
  • Marked the secrets reload system event as trusted: false to clear the OpenGrep explicit-trust finding on server.impl.ts.

Real behavior proof

  • Behavior or issue addressed: stale Gateway chat raw stream buffers can remain after abort/timeout/suppressed-output paths and grow gateway memory over time.
  • Real environment tested: local OpenClaw Gateway runtime from commit 0efb7685cbac7de522fb6fc604643993dfa9dc4c on macOS, using the real startGatewayMaintenanceTimers() source path.
  • Exact steps or command run after this patch: ran node --import tsx --input-type=module, created createChatRunState(), seeded a stale raw-only run in rawBuffers, bufferUpdatedAt, and deltaLastBroadcastText, started startGatewayMaintenanceTimers({ ...gatewayMaintenanceDeps, chatRunState }), then waited for the real maintenance interval with await new Promise((resolve) => setTimeout(resolve, 61_000)).
  • Evidence after fix: terminal output from the local runtime command:
{
  "label": "before maintenance",
  "runId": "runtime-proof-75089",
  "rawBuffers": 1,
  "bufferUpdatedAt": 1,
  "buffers": 0,
  "deltaSentAt": 0,
  "deltaLastBroadcastLen": 0,
  "deltaLastBroadcastText": 1,
  "agentDeltaSentAt": 0,
  "bufferedAgentEvents": 0
}
{
  "label": "after maintenance",
  "runId": "runtime-proof-75089",
  "rawBuffers": 0,
  "bufferUpdatedAt": 0,
  "buffers": 0,
  "deltaSentAt": 0,
  "deltaLastBroadcastLen": 0,
  "deltaLastBroadcastText": 0,
  "agentDeltaSentAt": 0,
  "bufferedAgentEvents": 0
}
  • Observed result after fix: after the real maintenance sweep, the stale raw-only run was removed from rawBuffers, bufferUpdatedAt, and deltaLastBroadcastText; projected/delta/agent maps were also empty.
  • What was not tested: production RSS graph was not captured; this proof targets the actual Gateway maintenance cleanup path responsible for sweeping stale per-run chat state.

Validation

  • pnpm test src/gateway/server.reload.test.ts src/gateway/server-maintenance.test.ts src/gateway/chat-abort.test.ts (3 files, 32 tests passed)
  • pnpm test src/gateway/server-methods/chat.abort-authorization.test.ts (1 file, 5 tests passed)
  • pnpm test:gateway (299 files, 3781 tests passed)
  • pnpm tsgo:core
  • pnpm tsgo:core:test
  • pnpm lint --threads=8 (previous local PR validation)
  • pnpm exec vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts
  • pnpm exec vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/model-fallback.test.ts -t "does not classify successful results when the optional classifier returns null" (from earlier investigation)

Note: full agents-core had one flaky timeout in src/agents/model-fallback.test.ts; the timed-out test passed when rerun directly.

@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime agents Agent runtime and tooling triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context. size: M labels Apr 30, 2026
@clawsweeper

clawsweeper Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 29, 2026, 4:48 AM ET / 08:48 UTC.

Summary
The PR centralizes Gateway chat run cleanup through clearRun, tracks raw-buffer update times, and extends stale maintenance cleanup to raw-only streamed chat buffers.

PR surface: Source +9, Tests +97. Total +106 across 21 files.

Reproducibility: yes. for source-level reproduction: current main writes raw assistant stream text into rawBuffers, while stale maintenance and abort cleanup do not clear that map. I did not run a live long-uptime RSS reproduction in this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] The PR is currently draft with an explicit member hold because the merge wrapper damaged the PR state; it should not be landed until that maintainer hold is cleared.
  • [P1] This patch addresses one Gateway chat stream buffer retention path and does not fully resolve the broader Gateway RSS/session accumulation tracked at Gateway memory leak: 389MB → 14.7GB over 4 days with session accumulation #54155.
  • [P1] The supplied proof targets the actual maintenance cleanup path, but it does not include a production RSS graph over long uptime.

Maintainer options:

  1. Decide the mitigation before merge
    Keep the focused cleanup, have a maintainer resolve the draft/hold state and refresh checks on the current head, then land it as a narrow Gateway memory-cleanup fix if the hold is cleared.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P2] A member explicitly put the draft PR on hold after the merge wrapper damaged the branch state, so the next action is maintainer review and hold clearance rather than automated repair.

Security
Cleared: The final diff changes Gateway TypeScript runtime and tests only, with no dependency, lockfile, workflow, package-script, secret-handling, or downloaded-code surface added.

Review details

Best possible solution:

Keep the focused cleanup, have a maintainer resolve the draft/hold state and refresh checks on the current head, then land it as a narrow Gateway memory-cleanup fix if the hold is cleared.

Do we have a high-confidence way to reproduce the issue?

Yes for source-level reproduction: current main writes raw assistant stream text into rawBuffers, while stale maintenance and abort cleanup do not clear that map. I did not run a live long-uptime RSS reproduction in this read-only review.

Is this the best way to solve the issue?

Yes for the narrow leak path: centralizing per-run cleanup in ChatRunState.clearRun is a maintainable fix and the added tests cover the raw-only stale-buffer case. The merge path still needs maintainer clearance because this PR is draft with an explicit hold.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 00ca654c7405.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P2: The PR addresses a real but bounded Gateway memory-cleanup bug with focused code and tests, without evidence of an emergency production-wide outage.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
Evidence reviewed

PR surface:

Source +9, Tests +97. Total +106 across 21 files.

View PR surface stats
Area Files Added Removed Net
Source 12 95 86 +9
Tests 9 149 52 +97
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 21 244 138 +106

What I checked:

  • Repository policy read: Read the full root AGENTS.md plus scoped Gateway guidance; the review applied the Gateway hot-path, proof, changelog, and maintainer-review rules. (AGENTS.md:1, 00ca654c7405)
  • Current main source gap: Current main keeps rawBuffers separate from projected buffers, writes raw streamed assistant text, and final cleanup removes raw buffers only on clean lifecycle finalization. (src/gateway/server-chat.ts:528, 00ca654c7405)
  • Current main stale cleanup misses raw buffers: The maintenance stale-buffer paths delete projected buffers and delta maps but do not delete chatRunState.rawBuffers, which matches the reported leak shape for orphaned raw-only streams. (src/gateway/server-maintenance.ts:242, 00ca654c7405)
  • PR implementation centralizes cleanup: The PR adds bufferUpdatedAt and a clearRun helper that deletes raw buffers, projected buffers, delta maps, broadcast text, and agent throttle buffers for a run. (src/gateway/server-chat-state.ts:102, 05ca0e30aca0)
  • PR maintenance coverage: The PR updates the maintenance sweep to use clearRun and adds a raw-buffer timestamp sweep for orphaned runs without active abort controllers. (src/gateway/server-maintenance.ts:232, 05ca0e30aca0)
  • Regression coverage: The PR adds focused maintenance tests proving active runs are preserved, orphaned stale raw/projected buffers are swept, aborted-run cleanup clears raw state, and raw-only stale buffers are removed. (src/gateway/server-maintenance.test.ts:311, 05ca0e30aca0)

Likely related people:

  • @steipete: Authored and committed several relevant Gateway chat lifecycle and stale-buffer fixes, including the guarded stale chat buffer sweep and chat abort hardening history. (role: feature-history owner; confidence: high; commits: f76e653776d9, 43b36bfe8cc7, a1533a17f703; files: src/gateway/server-maintenance.ts, src/gateway/server-chat.ts, src/gateway/chat-abort.ts)
  • @karanuppal: Authored the earlier stale chatRunState.buffers maintenance sweep that this PR extends to raw buffers. (role: original stale-buffer sweep contributor; confidence: medium; commits: 550deb8c1890; files: src/gateway/server-maintenance.ts)
  • @osolmaz: Authored the latest PR-head refactor commit and left the explicit member hold comment after force-pushing the branch. (role: recent PR follow-up owner; confidence: high; commits: 05ca0e30aca0; files: src/gateway/chat-abort.ts, src/gateway/server-maintenance.ts, src/gateway/server-close.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@litang9 litang9 changed the title [codex] fix gateway chat buffer cleanup fix(gateway): clear stale chat stream buffers Apr 30, 2026
@openclaw-barnacle openclaw-barnacle Bot added size: S and removed agents Agent runtime and tooling size: M labels Apr 30, 2026
@litang9 litang9 marked this pull request as ready for review April 30, 2026 13:35
@litang9

litang9 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper please re-review. I addressed the P2 by removing the toolStartData TTL/max-entry sweep from this PR. The final diff now only changes Gateway chat buffer cleanup and no longer deletes active tool-start records before their matching end event.

@litang9

litang9 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper please re-review.

Follow-ups from the last review are addressed:

  • Added the required CHANGELOG.md entry for the Gateway chat buffer memory-growth fix.
  • Marked the secrets reload system event as trusted: false to clear the OpenGrep explicit-trust finding.

Local checks passed: pnpm lint --threads=8, pnpm tsgo:core, pnpm tsgo:core:test, and targeted Gateway tests for reload, maintenance, and chat abort.

@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 15, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@litang9 litang9 force-pushed the codex/fix-gateway-buffer-cleanup branch from b146377 to 0efb768 Compare May 15, 2026 03:14
@openclaw-barnacle openclaw-barnacle Bot added size: M proof: supplied External PR includes structured after-fix real behavior proof. and removed size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@osolmaz osolmaz self-assigned this May 28, 2026
@openclaw-barnacle

Copy link
Copy Markdown

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label May 29, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 29, 2026
@osolmaz

osolmaz commented May 29, 2026

Copy link
Copy Markdown
Member

merge-pr wrapper script fucked up the PR, hold merge

@osolmaz osolmaz force-pushed the codex/fix-gateway-buffer-cleanup branch from 5d02ce0 to 05ca0e3 Compare May 29, 2026 08:30
@openclaw-barnacle openclaw-barnacle Bot removed docs Improvements or additions to documentation channel: msteams Channel integration: msteams channel: voice-call Channel integration: voice-call cli CLI command changes scripts Repository scripts agents Agent runtime and tooling extensions: phone-control extensions: together extensions: codex plugin: google-meet plugin: file-transfer extensions: brave labels May 29, 2026
@osolmaz

osolmaz commented May 29, 2026

Copy link
Copy Markdown
Member

Merged via squash.

  • Prepared head SHA: 05ca0e3
  • Merge commit: 18f9310
  • Diff after repair: +244/-138 across 21 gateway files

Verification:

  • gh pr checks 75089 --required --watch --fail-fast passed (dependency-guard).
  • node scripts/run-vitest.mjs src/gateway/chat-abort.test.ts src/gateway/server-maintenance.test.ts src/gateway/server-close.test.ts src/gateway/server-methods/models-auth-status.test.ts src/gateway/server-methods/chat.abort-authorization.test.ts passed: 9 files, 136 tests.
  • pnpm build and pnpm check passed after the rebase.
  • Local live-style Gateway WebSocket RPC proof passed with a mock OpenAI Responses SSE provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M stale Marked as stale due to inactivity status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants