Skip to content

Session lane starvation: followup drain monopolizes session lane, blocks inbound dispatch for 20-30min #54488

@Flakedict

Description

@Flakedict

Bug: Followup drain monopolizes session lane — causes indefinite inbound dispatch stall

Version: 2026.3.23-2 (not present in 2026.3.13)

Symptoms

  • After every agent turn, new inbound Discord DMs and WhatsApp messages are silently queued for 20-30 minutes before processing
  • sessions_send does not fix it (queues behind same backlog)
  • Only SIGUSR1 (gateway restart / resetAllLanes()) resolves it immediately
  • 100% reproducible on any session with active followup queue + compaction-heavy context

Root Cause

scheduleFollowupDrain (in pi-embedded-CbCYZxIb.js:94509) starts an unbounded async loop after every turn. Each queued item (system events, subagent announces, WhatsApp reconnects) calls runEmbeddedPiAgentenqueueSession(() => enqueueGlobal(...)), holding the session lane (maxConcurrent: 1) for the full turn duration including compaction + context engine maintenance. New user messages queue behind all followup turns with no preemption.

Observed lane wait times

From diagnostic logs on session:agent:main:main:

Time (Mar 25) Wait (ms) Wait (min)
10:28 1,361,250 ~22 min
13:42 1,814,033 ~30 min
13:49 1,237,693 ~20 min

Log pattern: lane wait exceeded: waitedMs=1814033 queueAhead=1

Contributing factors

  • Compaction safeguard runs summarization API calls within the lane task (context chronically at 142% of budget, compactionCount: 14)
  • Memory flush adds a full LLM call per compaction cycle (compaction.memoryFlush.enabled: true)
  • Ollama heartbeat/cron timeouts (5 min each) consume global lane slots and trigger retry chains with 404 fallback failures
  • WhatsApp reconnects generate bursts of system events (observed: 152 disconnect/reconnect events in one day) that each get processed as individual followup turns

Setup context

  • session.dmScope: "main" (Discord DM + WhatsApp share main session)
  • 10 agents configured, multiple WhatsApp groups, Discord guild with ~15 channels
  • Heartbeat: every 55m (was on ollama/qwen2.5-14b-agent, timeouts blocked global lanes)
  • contextPruning.mode: "cache-ttl" (custom events at end of turn correlated with stall, but turning it off did not fix it — lane starvation is the real cause)

Suggested fixes

  1. Cap consecutive followup drain turns (e.g., max 3) before yielding to inbound queue
  2. Prioritize user messages over system events in the session lane
  3. Run context engine maintenance (afterTurn/maintain) OUTSIDE the session lane task — the lane should be released after clearActiveEmbeddedRun, not after post-turn cleanup
  4. Add a configurable aggregate timeout for the full session lane task (not just compaction retry)

Workarounds (config-level mitigations)

These reduce lane occupation time but do not fix the root cause:

{
  agents: {
    defaults: {
      compaction: { reserveTokensFloor: 20000, memoryFlush: { enabled: false } },
      heartbeat: { model: "anthropic/claude-haiku-4-5" },  // was ollama (5min timeouts)
    },
  },
  messages: { queue: { debounceMs: 5000 } },  // batch system events
}

Reproduction

  1. Configure a session with dmScope: "main" and multiple active channels (WhatsApp + Discord)
  2. Enable compaction safeguard with memory flush
  3. Send a message → agent responds
  4. Wait 3-5 minutes, send again
  5. Message sits in lane queue for 20-30 minutes (or indefinitely with ollama timeouts)

Environment

  • macOS 12.7 (x64), Node v24.13.1
  • OpenClaw 2026.3.23-2 (npm global install)
  • Anthropic Claude Sonnet 4.6 (primary), Ollama Qwen 2.5 14B (heartbeat/crons)
  • Discord + WhatsApp + Telegram channels active

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions