Skip to content

fix(gateway): back off session tool mirrors under pressure#84846

Merged
galiniliev merged 1 commit into
openclaw:mainfrom
galiniliev:bug-033-queue-pressure-backoff
May 25, 2026
Merged

fix(gateway): back off session tool mirrors under pressure#84846
galiniliev merged 1 commit into
openclaw:mainfrom
galiniliev:bug-033-queue-pressure-backoff

Conversation

@galiniliev

@galiniliev galiniliev commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: gateway diagnostic heartbeats can keep reporting queued work during CPU/event-loop pressure while low-priority session tool mirrors continue at normal priority.
  • Solution: record a bounded diagnostic queue-pressure backoff window and use it to suppress non-terminal session-scoped tool mirrors.
  • What changed: diagnostic heartbeat state now exposes an active queue-pressure backoff, and gateway tool event fanout keeps run-scoped and terminal tool updates flowing while backing off lower-priority session mirrors.
  • What did NOT change (scope boundary): this does not change core tool execution, provider behavior, run-scoped tool subscribers, or terminal session tool events.

Motivation

  • The linked bug has a 60s CPU sample where gateway CPU averaged 83.66% with 42/60 samples at or above 100%, while heartbeats still reported active=1 queued=3 and a fetch timeout drifted by 7175ms. Backing off lower-priority mirror traffic during that condition reduces avoidable websocket work while preserving completion signals.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Gateway scheduler pressure from [Bug]: Gateway scheduler keeps work queued while CPU is saturated #84844. During sustained gateway CPU pressure with queued work, lower-priority session tool mirror events should back off while terminal and run-scoped tool events still flow.
  • Real environment tested: Local Linux OpenClaw worktree on Node 24, using the repo Vitest wrapper plus the redacted runtime/pidstat log evidence from the affected gateway run.
  • Exact steps or command run after this patch: node scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/gateway/server-chat.agent-events.test.ts and git diff --check.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): Copied terminal capture from the patched worktree:
$ node scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/gateway/server-chat.agent-events.test.ts

Test Files  2 passed (2)
     Tests  118 passed (118)

$ git diff --check
(no output; exit 0)
  • Observed result after fix: The regression coverage now proves diagnostic queue-pressure backoff is recorded and cleared, non-terminal session-scoped tool mirrors are suppressed during the active backoff window, run-scoped tool recipients still receive tool events, and terminal session tool events (end, error, result) are preserved so UI tool cards can complete.
  • What was not tested: I did not rerun the private 60s live CPU-saturation gateway session because the original setup/session details are local and redacted; the fix was verified at the diagnostic and gateway fanout seams that own this behavior.
  • Before evidence (optional but encouraged): Redacted source evidence from the affected run:
pidstat summary:
rows=60
avg_cpu=83.66
avg_usr=79.42
avg_sys=4.25
cpu_ge_100_count=42
max_cpu=190.0 at 05:07:47
avg_rd=0.00

Gateway log correlation:
2026-05-21T05:07:27.790+00:00 diagnostic heartbeat: webhooks=0/0/0 active=1 waiting=0 queued=3
2026-05-21T05:08:04.306+00:00 fetch-timeout timeoutMs=10000 elapsedMs=17175 timerDelayMs=7175 eventLoopDelayHint="timer delayed 7175ms, likely event-loop starvation"
2026-05-21T05:08:04.316+00:00 diagnostic heartbeat: webhooks=0/0/0 active=1 waiting=0 queued=3

Root Cause (if applicable)

  • Root cause: diagnostic heartbeat liveness already identified pressure reasons and queued work, but gateway session tool mirror fanout did not use that pressure signal to reduce lower-priority websocket work.
  • Missing detection / guardrail: there was no test covering queue-pressure backoff state or session tool mirror suppression during that state.
  • Contributing context (if known): terminal session tool events still need to be delivered even under pressure so clients do not leave tool cards stale.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/logging/diagnostic.test.ts, src/gateway/server-chat.agent-events.test.ts
  • Scenario the test should lock in: heartbeat queue pressure activates and clears the backoff state; active backoff suppresses non-terminal session tool mirrors but preserves run-scoped and terminal events.
  • Why this is the smallest reliable guardrail: the behavior is owned by the diagnostic heartbeat state and gateway websocket fanout seam, without needing a provider or live channel.
  • Existing test that already covers this (if any): none before this patch.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

During diagnostic queue pressure, lower-priority session tool mirror updates may be dropped temporarily; terminal session tool events and run-scoped tool events are still delivered.

Diagram (if applicable)

Before:
[diagnostic pressure + queued work] -> [all session tool mirrors continue]

After:
[diagnostic pressure + queued work] -> [non-terminal session mirrors back off] -> [terminal/run-scoped tool events still flow]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 24 local worktree
  • Model/provider: NOT_ENOUGH_INFO for the original runtime sample; tests do not require provider credentials.
  • Integration/channel (if any): Gateway websocket/tool event fanout
  • Relevant config (redacted): Original private session details were redacted.

Steps

  1. Run node scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/gateway/server-chat.agent-events.test.ts.
  2. Run git diff --check.
  3. Compare coverage to the redacted before evidence from [Bug]: Gateway scheduler keeps work queued while CPU is saturated #84844.

Expected

  • Queue pressure activates a bounded diagnostic backoff when the heartbeat reports pressure reasons and queued work.
  • Non-terminal session tool mirrors are skipped during that backoff.
  • Run-scoped tool listeners and terminal session events still receive updates.

Actual

  • Both targeted test files passed, 118 tests total.
  • git diff --check passed with no output.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: diagnostic queue-pressure backoff state activation/clearing; non-terminal session mirror suppression; terminal and run-scoped event preservation.
  • Edge cases checked: diagnostics disabled/stopped clears backoff; terminal tool phases still reach session clients.
  • What you did not verify: private live 60s CPU-saturation rerun from the original affected setup.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Suppressing session mirrors too broadly could leave clients with stale tool cards.
    • Mitigation: terminal session tool events are explicitly preserved and covered by regression tests.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: S maintainer Maintainer-authored PR labels May 21, 2026
@clawsweeper

clawsweeper Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 24, 2026, 9:26 PM ET / 01:26 UTC.

Summary
The PR adds diagnostic queue-pressure backoff state and uses it in gateway tool-event fanout to suppress non-terminal session-scoped tool mirrors while preserving run-scoped and terminal tool updates, with focused tests.

PR surface: Source +48, Tests +124. Total +172 across 4 files.

Reproducibility: no. high-confidence live reproduction was run in this read-only review. The linked issue provides redacted CPU and gateway logs, and current main source shows session tool mirrors are always sent when visible, so the source-level path is clear.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Add redacted after-fix runtime proof from a real gateway pressure run showing queue pressure, session mirror backoff, and preserved terminal/run-scoped tool delivery.
  • Wait for the two pending check runs to finish before merge.
  • Get explicit maintainer acceptance of the session-mirror delivery policy under pressure.

Proof guidance:
Needs real behavior proof before merge: The after-fix evidence is copied terminal output from targeted tests, not a real gateway pressure run; contributor action is needed with redacted logs, terminal output, or a recording before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • Merging this intentionally drops non-terminal session-scoped tool mirror messages during diagnostic pressure, so operator UIs may lose intermediate live tool updates under load.
  • The current proof is targeted tests plus before-run logs; it does not show an after-fix live gateway pressure run with mirror suppression and terminal/run-scoped delivery still working.
  • Two remote check runs were still pending at review time, so merge readiness still depends on the final check rollup.

Maintainer options:

  1. Require pressure-path proof (recommended)
    Ask for or run a redacted live gateway pressure proof showing non-terminal session mirrors backing off while run-scoped and terminal tool events still arrive.
  2. Accept seam coverage intentionally
    Maintainers can explicitly accept the delivery-policy risk based on the added diagnostic and gateway fanout tests if the live pressure setup cannot be rerun.
  3. Pause for UI policy review
    Pause the PR if losing intermediate session tool mirrors under pressure needs broader Control UI or operator-experience approval.

Next step before merge
Maintainer review is needed because the protected label, missing external runtime proof, and intentional message-delivery tradeoff are not a safe automated code-repair target.

Security
Cleared: The diff only changes in-process diagnostic state, gateway fanout branching, and tests; it adds no dependency, permission, credential, network, or code execution surface.

Review details

Best possible solution:

Land the gateway reliability change only after maintainers accept the session-mirror delivery tradeoff and review redacted after-fix runtime proof for the pressure path.

Do we have a high-confidence way to reproduce the issue?

No high-confidence live reproduction was run in this read-only review. The linked issue provides redacted CPU and gateway logs, and current main source shows session tool mirrors are always sent when visible, so the source-level path is clear.

Is this the best way to solve the issue?

Mostly yes, pending maintainer approval. The patch changes the owning diagnostic heartbeat and gateway fanout seams and preserves terminal/run-scoped tool events, but the intentional delivery tradeoff should be accepted with after-fix runtime proof.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f37fbc9ef49e.

Label changes

Label justifications:

  • P2: This is a focused gateway reliability fix for CPU-pressure behavior with limited blast radius but real operator-facing impact.
  • merge-risk: 🚨 message-delivery: The PR intentionally suppresses a class of session-scoped tool mirror messages during diagnostic pressure.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The after-fix evidence is copied terminal output from targeted tests, not a real gateway pressure run; contributor action is needed with redacted logs, terminal output, or a recording before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +48, Tests +124. Total +172 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 2 49 1 +48
Tests 2 124 0 +124
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 173 1 +172

What I checked:

  • Protected label: Live PR metadata lists the protected maintainer label, so cleanup should not close this PR and the delivery-policy change needs explicit human handling. (90575599fe45)
  • Current main baseline: Current main mirrors visible non-heartbeat tool events to all session subscribers as session.tool with dropIfSlow, so the PR changes message delivery behavior rather than removing dead code. (src/gateway/server-chat.ts:913, f37fbc9ef49e)
  • PR fanout behavior: The PR computes shouldMirrorSessionToolEvent from diagnostic backoff state and terminal tool phases, preserving terminal session mirrors while suppressing low-priority non-terminal mirrors. (src/gateway/server-chat.ts:922, 90575599fe45)
  • PR diagnostic behavior: The PR records a 60s queue-pressure backoff window when liveness sampling reports pressure reasons while queued work exists, and clears the state when diagnostics stop or test state resets. (src/logging/diagnostic.ts:1143, 90575599fe45)
  • Regression coverage: The PR adds gateway fanout tests for suppressing non-terminal session mirrors and preserving terminal mirrors, plus a diagnostic test for marking and clearing queue-pressure backoff. (src/gateway/server-chat.agent-events.test.ts:1525, 90575599fe45)
  • Linked bug evidence: The linked issue is still open and reports a Linux gateway CPU-saturation sample from OpenClaw 2026.5.20 with queued work and delayed timers; this PR uses closing syntax for that report, so the issue should remain open until the PR lands.

Likely related people:

  • obviyus: Local blame in the current shallow checkout attributes the inspected gateway fanout and diagnostic heartbeat baseline to commit cc5eb97. (role: recent area contributor; confidence: medium; commits: cc5eb972e69a; files: src/gateway/server-chat.ts, src/logging/diagnostic.ts)
  • samzong: GitHub path history shows recent gateway lifecycle and fanout work in src/gateway/server-chat.ts and its adjacent event-handler tests. (role: recent gateway contributor; confidence: medium; commits: bc2d501b1dcb, 9d56f4aa14a8, bb8aa0cfe2b0; files: src/gateway/server-chat.ts, src/gateway/server-chat.agent-events.test.ts)
  • steipete: GitHub history shows recent diagnostic liveness work and commits/merges on the gateway and diagnostics files touched by this PR. (role: recent diagnostics and gateway committer; confidence: medium; commits: 669786595d64, 66c64a29ee60, a6497b175905; files: src/logging/diagnostic.ts, src/logging/diagnostic.test.ts, src/gateway/server-chat.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. labels May 21, 2026
@clawsweeper

clawsweeper Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@galiniliev galiniliev force-pushed the bug-033-queue-pressure-backoff branch from 80286e6 to 9057559 Compare May 25, 2026 01:21
@galiniliev

Copy link
Copy Markdown
Contributor Author

Verification for 9057559:

Local:

  • node scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/gateway/server-chat.agent-events.test.ts - passed, 2 files / 121 tests
  • git diff --check upstream/main...HEAD - passed

CI:

  • Real behavior proof - passed, run 26378360356 / job 77642747291
  • Critical Quality (session-diagnostics-boundary) - passed, run 26378360896 / job 77642759353
  • Critical Quality (network-runtime-boundary) - passed, run 26378360896 / job 77642759330
  • checks-node-agentic-gateway-core - passed, run 26378360871 / job 77643274162
  • checks-node-agentic-agents - passed on rerun, run 26378360871 / job 77643274261
  • Broad PR CI otherwise green/skipped.

Known proof gap:

  • No fresh live CPU-saturation gateway run was rerun; landing accepts the focused diagnostic/gateway seam coverage and the intentional policy that non-terminal session tool mirrors may be dropped during diagnostic queue pressure while terminal and run-scoped tool events still flow.

@galiniliev galiniliev merged commit 42bdc94 into openclaw:main May 25, 2026
158 of 159 checks passed
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 25, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
steipete pushed a commit that referenced this pull request May 25, 2026
Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
SebTardif pushed a commit to SebTardif/openclaw that referenced this pull request May 26, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
…84846)

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime maintainer Maintainer-authored PR merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P2 Normal backlog priority with limited blast radius. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway scheduler keeps work queued while CPU is saturated

1 participant