Skip to content

[Bug]: enqueueSystemEvent not deduplicated by runId/contextKey — agents cascade duplicate exec approval prompts under new IDs, locking ecosystem #69478

@reidperyam

Description

@reidperyam

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Summary

Under load, enqueueSystemEvent does not deduplicate queued exec approval requests by runId or contextKey. When a heartbeat run times out and the gateway fails over, the replacement attempt re-queues the same exec call with a fresh approval ID. Each retry surfaces a new Telegram approval prompt for the identical command, cascading until the operator kills the gateway. Left alone, it saturates the approval channel fast enough to risk system-level memory pressure.

Reproduced repeatably on a multi-agent install. Filing now so it can be fixed before users with directPolicy: "allow" + high-frequency heartbeats discover it the hard way.

Steps to reproduce

What the exec call is

Routine health-check probe issued from Maelcum's heartbeat:

ps aux | grep -E "contextstored|vllm|openclaw" | grep -v grep | awk '{print $11}' | sort -n | tail -5

Hits on-miss under the current allowlist, so an approval prompt is expected on first encounter. The bug is that it fires again, and again, and again, each time under a new approval ID, for the same run intent.

Not a duplicate of

I looked for upstream issues that might cover this and found three that are adjacent but distinct:

None of these address the approval-event retry path or the (runId, contextKey) dedup gap.

Workaround in place

  • All 11 agent heartbeats set to every: "999h" (circuit breaker)
  • No agent work resumes on a normal schedule until this is fixed or a dedup workaround exists at the exec-approvals layer

Related bug (filing separately)

Telegram /approve allow-always writes a source field into the approvals allowlist entry that openclaw approvals set --file then rejects as unexpected on push. Will cross-reference the issue once filed.

Expected behavior

Either:

  1. enqueueSystemEvent deduplicates queued exec approval events by (agentId, contextKey) or (runId, contextKey), coalescing retries into the already-pending prompt; or
  2. When a run fails over, any exec approval events it queued are cancelled before the replacement run is allowed to enqueue new ones.

Today, neither happens.

bug-30-log-excerpt-clean.txt

Image

Actual behavior

Observed behavior

Continual, unceasing consecutive approval prompts delivered to Telegram seconds apart, identical command, different IDs:

  • befadc79-10bd-4e78-b1a4-9e2f546fd3c5
  • 871d7305-c1cc-412c-9393-d538e99e4ae1
  • etc.

Screenshot attached below.

Gateway log (/tmp/openclaw/openclaw-2026-04-18.log) shows the cascade signature (excerpt attached):

  • stuck session: sessionId=maelcum sessionId=<uuid> sessionKey=agent:maelcum:telegram:direct:<user_id> — age ticking up by ~30s per line, crossing 462s before intervention
  • embedded_run_failover_decision failoverReason=timeout — cycling through the provider chain: vllm-fastvllm-brainopenrouter/z-ai/glm-5
  • Heartbeat re-firing and regenerating the run under fresh runIds while the prior attempt is still pending approval

Each failover attempt re-enters enqueueSystemEvent carrying the same exec call, but the event queue has no compound key covering the (runId, contextKey) pair — so the prior queued approval does not cancel or collapse, and a new one is enqueued instead.

OpenClaw version

2026.4.14 (323493f)`

Operating system

macOS 26.4.1

Install method

npm global, latest stable as of filing

Model

mlx-community/Qwen3.5-9B-OptiQ-4bit (local, via rapid-mlx 0.3.12)

Provider / routing chain

openclaw -> vllm-fast (localhost:8001, rapid-mlx 0.3.12) -> Qwen3.5-9B-OptiQ-4bit

Additional provider/model setup details

Environment

  • Host: macOS, Mac Mini M4 Pro, 48 GB unified memory
  • Gateway: launchd-supervised, loopback bind, port 18789
  • Heartbeat: every: "3h", directPolicy: "allow", target: "telegram", lightContext: true
  • Exec approval policy: defaults.security: "allowlist", ask: "on-miss", askFallback: "deny"; maelcum uses host defaults

Logs, screenshots, and evidence

## Attached evidence

1. Screenshot of two consecutive approval prompts with different IDs for the same command
2. `bug-30-log-excerpt.txt` — 60 lines of the cascade from the gateway log

Impact and severity

Impact

  • Saturates the approval channel — every cascade cycle produces a new Telegram prompt
  • Fast enough to outrun manual intervention; forcing a gateway restart (openclaw gateway restart) is the only reliable stop
  • On installs with many agents sharing a channel, one stuck agent can drown all approval prompts for every other agent
  • Forced me to set all 11 heartbeats to every: "999h" as a circuit breaker while the bug is unresolved — effectively disabling the ecosystem's scheduled work layer

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingclawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:needs-security-reviewClawSweeper marked this issue as needing security-sensitive review.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:securitySecurity boundary, credential, authz, sandbox, or sensitive-data risk.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.regressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions