Skip to content

Event-loop starvation during context compaction causes fetch timeouts (16.9s timer delay) #86358

@Mithril1991

Description

@Mithril1991

Summary

During context overflow auto-compaction, the Node.js event loop stalls for ~17 seconds, causing pending fetch operations (e.g. Telegram API calls) to time out — even when their timeout is set to 10s. This is consistent with CPU-synchronous work blocking the event loop during compaction.

Environment

  • openclaw version: latest npm (npm info openclawgit+https://github.com/openclaw/openclaw.git)
  • Node.js: 22.22.0
  • Provider: openai-codex/gpt-5.5
  • Platform: Ubuntu 22.04

Observed sequence

06:31:07 WARN [context-overflow-diag]
  sessionKey=agent:main:telegram:group:...:topic:539
  provider=openai-codex/gpt-5.5
  messages=246
  compactionAttempts=0
  error=Context overflow: estimated context size exceeds safe threshold during tool loop

06:31:07 WARN context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5

06:32:13 WARN [fetch-timeout]
  timeoutMs=10000
  elapsedMs=26963
  timerDelayMs=16963
  eventLoopDelayHint="timer delayed 16963ms, likely event-loop starvation"
  operation=fetchWithTimeout
  url=https://api.telegram.org/bot.../getMe

06:33:33 INFO auto-compaction succeeded for openai-codex/gpt-5.5; retrying prompt
06:33:33 INFO post-compaction guard armed for 3 attempts

The timerDelayMs=16963 in your own [fetch-timeout] log confirms the event loop was blocked for 16.9s during compaction — the 10s fetch timer couldn't fire until 26.9s elapsed.

Cascading effect

After compaction the agent resumed but then ran two web search tool calls that both hit MCP -32001 timeout:

ERROR [tools] kindly-search__web_search failed: MCP error -32001: Request timed out
ERROR [tools] kindly-search__web_search failed: MCP error -32001: Request timed out

These may also be caused by the event loop being saturated post-compaction, or by MCP server state after the stall.

Expected behaviour

Compaction should not block the event loop. If it involves heavy JSON serialisation / summarisation API calls, those should be done in a worker thread or with setImmediate yields so pending timers can fire normally.

Suggested fix direction

  • Move compaction's CPU-heavy phase (token counting, session serialisation, summarisation request) to a worker thread or split with setImmediate to yield the event loop
  • Alternatively, drain and re-arm pending fetch timeouts after compaction completes

Impact

In our setup (agent-chat-telegram orchestrator driving OpenClaw as a subprocess), the stall causes the orchestrator's own 300s timeout to eventually fire and terminate the OpenClaw call, surfacing as a generic failure to the end user.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions