Skip to content

[Bug]: ReplyRunAlreadyActiveError fires every other gateway-WS chat call, causing 50% reply failure on 2026.5.3 #77485

@bws14email

Description

@bws14email

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Every other WebSocket chat request through the gateway fails with ReplyRunAlreadyActiveError, causing 50% reply failure; reproduces deterministically on 2026.5.3 and works fine on 2026.4.26.

Steps to reproduce

  1. Install OpenClaw 2026.5.3 side-by-side with a working 2026.4.26 install.

  2. Switch the active version to 2026.5.3 and start the gateway in embedded mode.

  3. Send 10 sequential chat requests through the gateway WebSocket path, each waiting for the prior to return:

    for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -s -X POST http://127.0.0.1:18789/chat.send
    -H "Content-Type: application/json"
    -d "{"sessionKey":"agent:test:main","message":"Reply containing the literal text: ok-$i-$(date +%s)"}"
    sleep 1
    done

  4. Inspect the gateway error log; ReplyRunAlreadyActiveError fires 20+ times across the probe.

Expected behavior

All 10 sequential chat requests reach the LLM and return real replies, matching the 2026.4.26 behavior where the same workload produces ~1.2-1.7s warm replies on every call with zero ReplyRunAlreadyActiveError events in the gateway log.

Actual behavior

Alternating fast-fail / pass pattern. 10-call probe wall-clock timings:

call 1: 346ms FAIL — empty/canned reply, no LLM round-trip
call 2: 1321ms pass — real reply
call 3: 330ms FAIL
call 4: 1502ms pass
call 5: 339ms FAIL
call 6: 1238ms pass
call 7: 330ms FAIL
call 8: 1447ms pass
call 9: 330ms FAIL
call 10: 1552ms pass

Gateway error log shows 26 occurrences of:
followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main

Failed-call timings (~330ms) are below provider RTT, confirming the gateway throws before any LLM call is dispatched.

OpenClaw version

2026.5.3 (commit 06d46f7); regression vs 2026.4.26 (commit be8c246) which works correctly

Operating system

Ubuntu 24.04 noble (Linux x64), kernel 6.8

Install method

Custom side-by-side installer running: npm install --production --legacy-peer-deps --ignore-scripts; node_modules/openclaw -> ../ self-symlink applied post-install

Model

gemini/gemini-2.5-flash

Provider / routing chain

custom relay -> gateway WebSocket -> openclaw -> openai-completions -> local Gemini proxy (127.0.0.1:19990) -> Google Gemini API

Additional provider/model setup details

  • Gemini provider config points at http://127.0.0.1:19990/v1beta/openai (a local proxy that forwards to Google Gemini), with allowPrivateNetwork: true.
  • Fallback: ollama gemma4-chat (not exercised in this probe).
  • Single local plugin loaded via plugins.load.paths from a workspace path (50 tools, registered via api.registerTool).
  • For 5.3, the plugin manifest was patched to declare contracts.tools per the new schema (verified from bundled file-transfer and memory-wiki manifests). After patching, the "[plugins] plugin must declare contracts.tools" error is gone and tools.allow allowlist consistency passes, but the ReplyRunAlreadyActiveError pattern is unchanged — confirming the two bugs are independent.
  • Three per-tenant agents in agents.list, each with its own tools.allow allowlist of all 50 plugin tool names.
  • Gateway runs in embedded mode under PM2 (not systemd), single process bound to 127.0.0.1:18789.
  • Config relevant to the reply-run path: agent uses thinkingDefault "off"; no streaming.* keys; no tools.profile override.

Logs, screenshots, and evidence

Gateway error log excerpt (26 identical occurrences across one 10-call probe):

  followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main

Suspected source — dist/run-state-B5YH0TzQ.js:
  if (replyRunState.activeRunsByKey.has(sessionKey))
    throw new ReplyRunAlreadyActiveError(sessionKey);

Plugin SDK type definition — dist/plugin-sdk/src/auto-reply/reply/reply-run-registry.d.ts:
  export type ReplyOperationPhase =
    "queued" | "preflight_compacting" | "memory_flushing" |
    "running" | "completed" | "failed" | "aborted";

forceClearReplyRunBySessionId is exported from dist/run-state-B5YH0TzQ.js — usable as an external workaround, but its existence suggests the intended cleanup path inside begin()/run-completion isn't always reaching terminal phases on the embedded WebSocket path.

Comparison run on 2026.4.26 with the identical config: same 10-call probe completes with all replies present, average warm latency 1.2-1.7s, zero ReplyRunAlreadyActiveError events in gateway log.

I can attach: full gateway log around the affected window, sanitized openclaw.json, and the patched plugin manifest. Let me know format preference.

Impact and severity

Affected: any operator using the gateway WebSocket chat.send path with sequential per-session requests on 2026.5.3.
Severity: High — blocks chat workflow (50% reply failure).
Frequency: Always — deterministic alternating pattern reproduced 4/4 times across separate gateway restarts.
Consequence: Production WebSocket-based agents are unusable on 2026.5.3 — every other user message fails before reaching the LLM. Forces rollback for any production deployment.

Additional information

Last known good version: 2026.4.26 (commit be8c246). First known bad version: 2026.5.3 (commit 06d46f7). Did not test 2026.4.27 / 2026.4.29 — installed side-by-side but never switched.

Pre-switch sandbox checks (all passed against 2026.5.3):

  • Standalone catalog smoke (HOME=... node openclaw.mjs models): clean, no import errors, gemini + ollama providers configured.
  • Sandbox doctor (read-only against copied config): only flagged the legacy plugins.entries.codex.enabled=false entry as stale (informational warning, codex extension removed in 5.3).
  • Sandbox doctor --fix + diff vs original: removed only the stale codex entry; cosmetic JSON reformatting; metadata updates (lastRunAt, lastRunVersion). No destructive mutations.

The plugin-side blocker ("[plugins] plugin must declare contracts.tools before registering agent tools") was resolved by patching our plugin manifest with a contracts.tools array of all 50 tool names. After that fix the plugin registers cleanly and the 50-tool agent allowlist passes consistency. ReplyRunAlreadyActiveError persists with or without that fix — confirming the two issues are independent.

Workaround in production: rolled back to 2026.4.26.

Repro environment is reproducible on demand. Happy to test a fix candidate from a branch — deterministic 30-second test loop available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions