fix(gateway): clear stale chat stream buffers by litang9 · Pull Request #75089 · openclaw/openclaw

litang9 · 2026-04-30T13:14:54Z

Summary

Clear gateway chat raw/projected/delta buffers through one clearRun path on normal finish, abort, timeout, and stale maintenance cleanup.
Track bufferUpdatedAt so raw-only streamed buffers that never emitted deltas can be swept after their run is no longer active.
Include current-main broadcast text and agent text throttle state in the same per-run cleanup path.

Root cause

Gateway rawBuffers can retain full streamed assistant text. Some cleanup paths removed only projected buffers and delta metadata, and suppressed/raw-only output could avoid deltaSentAt entirely, so maintenance could miss it.

Impact

When long or frequent streamed chat runs hit abort, timeout, suppressed-output, or missing terminal-event paths, the retained raw buffers can accumulate in the gateway process and cause gateway memory to grow over time. This PR ensures those per-run buffers are cleaned consistently so stale chat state does not become a gateway memory leak.

Related Issues

Related to #51821, which reported stale chatRunState.buffers cleanup gaps for stuck runs.

Contributes to the broader Gateway RSS/session accumulation tracker #54155, but does not fully close that tracker.

Review follow-up

Refreshed the branch onto current main so the PR no longer relies on the stale Apr 30 base.
Included current-main deltaLastBroadcastText, agentDeltaSentAt, and bufferedAgentEvents in ChatRunState.clearRun() and full-state cleanup.
Addressed the original ClawSweeper P2 by removing the unrelated age-based toolStartData sweep from this PR. The final diff stays focused on Gateway chat buffer cleanup and does not delete active tool-start records before their matching end event.
Added the required CHANGELOG.md entry for the user-facing Gateway chat buffer memory-growth fix.
Marked the secrets reload system event as trusted: false to clear the OpenGrep explicit-trust finding on server.impl.ts.

Real behavior proof

Behavior or issue addressed: stale Gateway chat raw stream buffers can remain after abort/timeout/suppressed-output paths and grow gateway memory over time.
Real environment tested: local OpenClaw Gateway runtime from commit 0efb7685cbac7de522fb6fc604643993dfa9dc4c on macOS, using the real startGatewayMaintenanceTimers() source path.
Exact steps or command run after this patch: ran node --import tsx --input-type=module, created createChatRunState(), seeded a stale raw-only run in rawBuffers, bufferUpdatedAt, and deltaLastBroadcastText, started startGatewayMaintenanceTimers({ ...gatewayMaintenanceDeps, chatRunState }), then waited for the real maintenance interval with await new Promise((resolve) => setTimeout(resolve, 61_000)).
Evidence after fix: terminal output from the local runtime command:

{
  "label": "before maintenance",
  "runId": "runtime-proof-75089",
  "rawBuffers": 1,
  "bufferUpdatedAt": 1,
  "buffers": 0,
  "deltaSentAt": 0,
  "deltaLastBroadcastLen": 0,
  "deltaLastBroadcastText": 1,
  "agentDeltaSentAt": 0,
  "bufferedAgentEvents": 0
}

{
  "label": "after maintenance",
  "runId": "runtime-proof-75089",
  "rawBuffers": 0,
  "bufferUpdatedAt": 0,
  "buffers": 0,
  "deltaSentAt": 0,
  "deltaLastBroadcastLen": 0,
  "deltaLastBroadcastText": 0,
  "agentDeltaSentAt": 0,
  "bufferedAgentEvents": 0
}

Observed result after fix: after the real maintenance sweep, the stale raw-only run was removed from rawBuffers, bufferUpdatedAt, and deltaLastBroadcastText; projected/delta/agent maps were also empty.
What was not tested: production RSS graph was not captured; this proof targets the actual Gateway maintenance cleanup path responsible for sweeping stale per-run chat state.

Validation

pnpm test src/gateway/server.reload.test.ts src/gateway/server-maintenance.test.ts src/gateway/chat-abort.test.ts (3 files, 32 tests passed)
pnpm test src/gateway/server-methods/chat.abort-authorization.test.ts (1 file, 5 tests passed)
pnpm test:gateway (299 files, 3781 tests passed)
pnpm tsgo:core
pnpm tsgo:core:test
pnpm lint --threads=8 (previous local PR validation)
pnpm exec vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts
pnpm exec vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/model-fallback.test.ts -t "does not classify successful results when the optional classifier returns null" (from earlier investigation)

Note: full agents-core had one flaky timeout in src/agents/model-fallback.test.ts; the timed-out test passed when rerun directly.

clawsweeper · 2026-04-30T13:20:35Z

Codex review: needs maintainer review before merge. Reviewed May 29, 2026, 4:48 AM ET / 08:48 UTC.

Summary
The PR centralizes Gateway chat run cleanup through clearRun, tracks raw-buffer update times, and extends stale maintenance cleanup to raw-only streamed chat buffers.

PR surface: Source +9, Tests +97. Total +106 across 21 files.

Reproducibility: yes. for source-level reproduction: current main writes raw assistant stream text into rawBuffers, while stale maintenance and abort cleanup do not clear that map. I did not run a live long-uptime RSS reproduction in this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

none.

Risk before merge

[P1] The PR is currently draft with an explicit member hold because the merge wrapper damaged the PR state; it should not be landed until that maintainer hold is cleared.
[P1] This patch addresses one Gateway chat stream buffer retention path and does not fully resolve the broader Gateway RSS/session accumulation tracked at Gateway memory leak: 389MB → 14.7GB over 4 days with session accumulation #54155.
[P1] The supplied proof targets the actual maintenance cleanup path, but it does not include a production RSS graph over long uptime.

Maintainer options:

Decide the mitigation before merge
Keep the focused cleanup, have a maintainer resolve the draft/hold state and refresh checks on the current head, then land it as a narrow Gateway memory-cleanup fix if the hold is cleared.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

[P2] A member explicitly put the draft PR on hold after the merge wrapper damaged the branch state, so the next action is maintainer review and hold clearance rather than automated repair.

Security
Cleared: The final diff changes Gateway TypeScript runtime and tests only, with no dependency, lockfile, workflow, package-script, secret-handling, or downloaded-code surface added.

Review details

Best possible solution:

Keep the focused cleanup, have a maintainer resolve the draft/hold state and refresh checks on the current head, then land it as a narrow Gateway memory-cleanup fix if the hold is cleared.

Do we have a high-confidence way to reproduce the issue?

Yes for source-level reproduction: current main writes raw assistant stream text into rawBuffers, while stale maintenance and abort cleanup do not clear that map. I did not run a live long-uptime RSS reproduction in this read-only review.

Is this the best way to solve the issue?

Yes for the narrow leak path: centralizing per-run cleanup in ChatRunState.clearRun is a maintainable fix and the added tests cover the raw-only stale-buffer case. The merge path still needs maintainer clearance because this PR is draft with an explicit hold.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 00ca654c7405.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

P2: The PR addresses a real but bounded Gateway memory-cleanup bug with focused code and tests, without evidence of an emergency production-wide outage.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body provides after-fix terminal output from a local Gateway runtime using the real maintenance timer path, showing seeded stale raw buffer state cleared after the sweep.

Evidence reviewed

PR surface:

Source +9, Tests +97. Total +106 across 21 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	12	95	86	+9
Tests	9	149	52	+97
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	21	244	138	+106

What I checked:

Repository policy read: Read the full root AGENTS.md plus scoped Gateway guidance; the review applied the Gateway hot-path, proof, changelog, and maintainer-review rules. (AGENTS.md:1, 00ca654c7405)
Current main source gap: Current main keeps rawBuffers separate from projected buffers, writes raw streamed assistant text, and final cleanup removes raw buffers only on clean lifecycle finalization. (src/gateway/server-chat.ts:528, 00ca654c7405)
Current main stale cleanup misses raw buffers: The maintenance stale-buffer paths delete projected buffers and delta maps but do not delete chatRunState.rawBuffers, which matches the reported leak shape for orphaned raw-only streams. (src/gateway/server-maintenance.ts:242, 00ca654c7405)
PR implementation centralizes cleanup: The PR adds bufferUpdatedAt and a clearRun helper that deletes raw buffers, projected buffers, delta maps, broadcast text, and agent throttle buffers for a run. (src/gateway/server-chat-state.ts:102, 05ca0e30aca0)
PR maintenance coverage: The PR updates the maintenance sweep to use clearRun and adds a raw-buffer timestamp sweep for orphaned runs without active abort controllers. (src/gateway/server-maintenance.ts:232, 05ca0e30aca0)
Regression coverage: The PR adds focused maintenance tests proving active runs are preserved, orphaned stale raw/projected buffers are swept, aborted-run cleanup clears raw state, and raw-only stale buffers are removed. (src/gateway/server-maintenance.test.ts:311, 05ca0e30aca0)

Likely related people:

@steipete: Authored and committed several relevant Gateway chat lifecycle and stale-buffer fixes, including the guarded stale chat buffer sweep and chat abort hardening history. (role: feature-history owner; confidence: high; commits: f76e653776d9, 43b36bfe8cc7, a1533a17f703; files: src/gateway/server-maintenance.ts, src/gateway/server-chat.ts, src/gateway/chat-abort.ts)
@karanuppal: Authored the earlier stale chatRunState.buffers maintenance sweep that this PR extends to raw buffers. (role: original stale-buffer sweep contributor; confidence: medium; commits: 550deb8c1890; files: src/gateway/server-maintenance.ts)
@osolmaz: Authored the latest PR-head refactor commit and left the explicit member hold comment after force-pushing the branch. (role: recent PR follow-up owner; confidence: high; commits: 05ca0e30aca0; files: src/gateway/chat-abort.ts, src/gateway/server-maintenance.ts, src/gateway/server-close.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

litang9 · 2026-04-30T13:37:55Z

@clawsweeper please re-review. I addressed the P2 by removing the toolStartData TTL/max-entry sweep from this PR. The final diff now only changes Gateway chat buffer cleanup and no longer deletes active tool-start records before their matching end event.

litang9 · 2026-04-30T23:25:27Z

@clawsweeper please re-review.

Follow-ups from the last review are addressed:

Added the required CHANGELOG.md entry for the Gateway chat buffer memory-growth fix.
Marked the secrets reload system event as trusted: false to clear the OpenGrep explicit-trust finding.

Local checks passed: pnpm lint --threads=8, pnpm tsgo:core, pnpm tsgo:core:test, and targeted Gateway tests for reload, maintenance, and chat abort.

openclaw-barnacle · 2026-05-29T04:52:55Z

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

osolmaz · 2026-05-29T08:29:58Z

merge-pr wrapper script fucked up the PR, hold merge

osolmaz · 2026-05-29T09:02:19Z

Merged via squash.

Prepared head SHA: 05ca0e3
Merge commit: 18f9310
Diff after repair: +244/-138 across 21 gateway files

Verification:

gh pr checks 75089 --required --watch --fail-fast passed (dependency-guard).
node scripts/run-vitest.mjs src/gateway/chat-abort.test.ts src/gateway/server-maintenance.test.ts src/gateway/server-close.test.ts src/gateway/server-methods/models-auth-status.test.ts src/gateway/server-methods/chat.abort-authorization.test.ts passed: 9 files, 136 tests.
pnpm build and pnpm check passed after the rebase.
Local live-style Gateway WebSocket RPC proof passed with a mock OpenAI Responses SSE provider.

openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime agents Agent runtime and tooling triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context. size: M labels Apr 30, 2026

litang9 changed the title ~~[codex] fix gateway chat buffer cleanup~~ fix(gateway): clear stale chat stream buffers Apr 30, 2026

openclaw-barnacle Bot added size: S and removed agents Agent runtime and tooling size: M labels Apr 30, 2026

litang9 marked this pull request as ready for review April 30, 2026 13:35

clawsweeper Bot mentioned this pull request May 1, 2026

Gateway memory leak: 389MB → 14.7GB over 4 days with session accumulation #54155

Open

This was referenced May 10, 2026

Gateway retains raw chat buffers and stalls on stale session index growth #80282

Closed

[Bug]: Gateway process memory leak - Working Set grows from ~500MB to 1.1GB then crash (OpenClaw 2026.5.6, Windows) #81733

Closed

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 15, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026

litang9 force-pushed the codex/fix-gateway-buffer-cleanup branch from b146377 to 0efb768 Compare May 15, 2026 03:14

openclaw-barnacle Bot added size: M proof: supplied External PR includes structured after-fix real behavior proof. and removed size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026

osolmaz self-assigned this May 28, 2026

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label May 29, 2026

clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 29, 2026

osolmaz marked this pull request as draft May 29, 2026 08:29

osolmaz force-pushed the codex/fix-gateway-buffer-cleanup branch from 5d02ce0 to 05ca0e3 Compare May 29, 2026 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gateway): clear stale chat stream buffers#75089

fix(gateway): clear stale chat stream buffers#75089
osolmaz merged 2 commits into
openclaw:mainfrom
litang9:codex/fix-gateway-buffer-cleanup

litang9 commented Apr 30, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

litang9 commented Apr 30, 2026

Uh oh!

litang9 commented Apr 30, 2026

Uh oh!

openclaw-barnacle Bot commented May 29, 2026

Uh oh!

osolmaz commented May 29, 2026

Uh oh!

osolmaz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

litang9 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Impact

Related Issues

Review follow-up

Real behavior proof

Validation

Uh oh!

clawsweeper Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

litang9 commented Apr 30, 2026

Uh oh!

litang9 commented Apr 30, 2026

Uh oh!

openclaw-barnacle Bot commented May 29, 2026

Uh oh!

osolmaz commented May 29, 2026

Uh oh!

osolmaz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

litang9 commented Apr 30, 2026 •

edited

Loading

clawsweeper Bot commented Apr 30, 2026 •

edited

Loading