Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
When params.abortSignal is already aborted before activeSession.prompt() is called (e.g. rapid consecutive messages with messages.queue.mode: "interrupt"), abortable() immediately rejects but the prompt() async chain has already started. The floating Promise creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts, causing the Agent to loop indefinitely calling the LLM after the attempt has exited. Observed: 2617 LLM calls over 103 minutes from a single zombie run.
Steps to reproduce
- Configure
messages.queue.mode: "interrupt" in openclaw.json
- Send a message to the agent
- Within < 1 second, send a second message (interrupt mode aborts the first run)
- Observe that the first run's Agent continues calling the LLM in the background after the attempt has exited
Alternatively, run the reproduction test below which uses a pre-aborted AbortSignal to simulate the same condition deterministically.
agent-zombie-loop.test.ts (click to expand)
import { Agent, type AgentMessage } from "@mariozechner/pi-agent-core";
import type { Api, Message, Model } from "@mariozechner/pi-ai";
import { afterEach, beforeEach, describe, expect, it } from "vitest";
import {
createDefaultEmbeddedSession,
getHoisted,
resetEmbeddedAttemptHarness,
testModel,
} from "./attempt.spawn-workspace.test-support.js";
const sleep = (ms: number) => new Promise<void>((r) => setTimeout(r, ms));
const mockModel = testModel as unknown as Model<Api>;
const mockTool = {
name: "mock_tool",
label: "Mock Tool",
description: "mock",
parameters: { type: "object" as const, properties: {} },
execute: async () => ({ content: [{ type: "text" as const, text: "Aborted" }], details: {} }),
};
function createToolUseStreamFn(tracker: { count: number }) {
return async (_model: unknown, _context: unknown, options?: { signal?: AbortSignal }) => {
tracker.count += 1;
await sleep(5);
if (options?.signal?.aborted) {
const err = new Error("Request was aborted.");
err.name = "AbortError";
throw err;
}
const message = {
role: "assistant" as const,
content: [
{ type: "toolCall" as const, id: `call_${tracker.count}`, name: "mock_tool", arguments: {} },
],
usage: { input: 70, output: 51, cacheRead: 0, cacheWrite: 0, totalTokens: 121, cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 } },
stopReason: "toolUse" as const,
timestamp: Date.now(),
};
return {
[Symbol.asyncIterator]() {
let done = false;
return { async next() { if (!done) { done = true; return { done: false, value: { type: "done", message } }; } return { done: true, value: undefined }; } };
},
async result() { return message; },
} as never;
};
}
const hoisted = getHoisted();
describe("Agent zombie loop (upstream bug)", () => {
beforeEach(() => { resetEmbeddedAttemptHarness(); });
it("bug: abort-before-prompt produces floating Promise, Agent loops after attempt exits", { timeout: 10_000 }, async () => {
const tracker = { count: 0 };
const agent = new Agent({
initialState: { systemPrompt: "test", model: mockModel, tools: [mockTool] },
streamFn: createToolUseStreamFn(tracker),
convertToLlm: (msgs: AgentMessage[]): Message[] =>
msgs.filter((m) => ["user", "assistant", "toolResult"].includes(m.role)) as Message[],
});
hoisted.createAgentSessionMock.mockResolvedValue({
session: createDefaultEmbeddedSession({
prompt: async (_session, prompt) => {
agent.prompt(prompt).catch(() => {});
await sleep(50);
},
}),
});
const abortSignal = AbortSignal.abort(new Error("second message arrived"));
const { runEmbeddedAttempt } = await import("./attempt.js");
await runEmbeddedAttempt({
sessionId: "zombie-test", sessionKey: "agent:main:main",
sessionFile: "/tmp/zombie-test.jsonl", workspaceDir: "/tmp", agentDir: "/tmp",
config: {}, prompt: "first message", timeoutMs: 5_000, runId: "zombie-run",
provider: "openai", modelId: "gpt-test", model: mockModel,
authStorage: { getApiKey: async () => undefined } as never,
modelRegistry: {} as never, thinkLevel: "off",
senderIsOwner: true, disableMessageTool: true, abortSignal,
});
const countAtExit = tracker.count;
await sleep(500);
const countAfterWait = tracker.count;
console.log(`LLM calls at exit=${countAtExit}, after 500ms=${countAfterWait}, delta=${countAfterWait - countAtExit}`);
expect(countAfterWait).toBeGreaterThan(countAtExit);
agent.abort();
agent.clearAllQueues?.();
await agent.waitForIdle();
});
});
Expected behavior
When a run is aborted (via interrupt mode, timeout, or RPC), the Agent should stop all LLM calls promptly. No floating Promises should outlive the attempt lifecycle.
Actual behavior
The Agent continues calling the LLM indefinitely after the attempt has returned. Each iteration: ~90k input tokens + ~35 output tokens, stopReason always toolUse, tools always throw AbortError (caught as error result), model retries the same tool call. Loop never terminates unless the process restarts.
OpenClaw version
All releases since v2026.1.20 (bug introduced in commit 016693a1f on 2026-01-18)
Operating system
Linux (also reproducible on macOS)
Install method
pnpm dev / npm global
Model
Any model (bug is model-agnostic; the loop is in the Agent runtime, not the LLM)
Provider / routing chain
Any provider (bug is provider-agnostic)
Additional provider/model setup details
NOT_ENOUGH_INFO
Logs, screenshots, and evidence
Production observations across 3 independent cases:
| Case |
Trigger |
Duration |
LLM calls |
| 1 |
timeout-compaction retry |
76 min |
~2130 |
| 2 |
timeout-compaction retry |
2+ hours |
~952 (log truncated) |
| 3 |
user rapid messages (652ms apart) |
103 min |
2617 |
Log signature of a zombie run:
embedded run prompt end durationMs=<very small, e.g. 22-26ms> (abortable() rejected immediately)
- Continued
model.usage stopReason=toolUse lines after run cleanup for the same runId
- All tool results are
"Aborted" (error result)
embedded run done never appears
Impact and severity
Affected: Any user with messages.queue.mode: "interrupt" who sends rapid consecutive messages
Severity: High — silent resource drain, potential large API cost
Frequency: Near-deterministic with interrupt mode + rapid messages; lower probability via timeout-compaction
Consequence: Unbounded LLM API cost, server resource exhaustion, no user-visible indication of the problem
Additional information
Root cause: await abortable(activeSession.prompt(effectivePrompt)) in attempt.ts (introduced in 016693a1f). JavaScript evaluates activeSession.prompt() first (starting the async chain), then abortable() races it. When the signal is pre-aborted, abortable() rejects immediately but the floating Promise from prompt() creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts.
Why the inner loop never stops: Agent._runLoop() in pi-agent-core only exits on stopReason === "error" | "aborted". The zombie's signal is never aborted. Tools throw AbortError (from the outer runAbortController.signal), but this is caught as an error tool result — the model retries indefinitely.
Why the circuit breaker doesn't fire: Tool wrapper order is abort-check (outer) → loop-detection (inner). The abort throw short-circuits before the loop detector ever runs.
Proposed 3-layer fix:
- Pre-prompt guard: check
aborted state before calling activeSession.prompt() — eliminates the floating Promise at source
finally block: call agent.abort() + agent.clearAllQueues() during attempt cleanup — terminates any escaped Agent
- Per-run LLM call hard cap: shared counter across attempts, configurable via
agents.defaults.maxLlmCallsPerRun — ultimate safety net independent of abort signal propagation
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
When
params.abortSignalis already aborted beforeactiveSession.prompt()is called (e.g. rapid consecutive messages withmessages.queue.mode: "interrupt"),abortable()immediately rejects but theprompt()async chain has already started. The floating Promise creates a newAgent._runLoop()with a freshabortControllerthat nobody ever aborts, causing the Agent to loop indefinitely calling the LLM after the attempt has exited. Observed: 2617 LLM calls over 103 minutes from a single zombie run.Steps to reproduce
messages.queue.mode: "interrupt"inopenclaw.jsonAlternatively, run the reproduction test below which uses a pre-aborted
AbortSignalto simulate the same condition deterministically.agent-zombie-loop.test.ts (click to expand)
Expected behavior
When a run is aborted (via interrupt mode, timeout, or RPC), the Agent should stop all LLM calls promptly. No floating Promises should outlive the attempt lifecycle.
Actual behavior
The Agent continues calling the LLM indefinitely after the attempt has returned. Each iteration: ~90k input tokens + ~35 output tokens, stopReason always
toolUse, tools always throwAbortError(caught as error result), model retries the same tool call. Loop never terminates unless the process restarts.OpenClaw version
All releases since v2026.1.20 (bug introduced in commit
016693a1fon 2026-01-18)Operating system
Linux (also reproducible on macOS)
Install method
pnpm dev / npm global
Model
Any model (bug is model-agnostic; the loop is in the Agent runtime, not the LLM)
Provider / routing chain
Any provider (bug is provider-agnostic)
Additional provider/model setup details
NOT_ENOUGH_INFO
Logs, screenshots, and evidence
Production observations across 3 independent cases:
Log signature of a zombie run:
embedded run prompt end durationMs=<very small, e.g. 22-26ms>(abortable() rejected immediately)model.usage stopReason=toolUselines afterrun cleanupfor the same runId"Aborted"(error result)embedded run donenever appearsImpact and severity
Affected: Any user with
messages.queue.mode: "interrupt"who sends rapid consecutive messagesSeverity: High — silent resource drain, potential large API cost
Frequency: Near-deterministic with interrupt mode + rapid messages; lower probability via timeout-compaction
Consequence: Unbounded LLM API cost, server resource exhaustion, no user-visible indication of the problem
Additional information
Root cause:
await abortable(activeSession.prompt(effectivePrompt))inattempt.ts(introduced in016693a1f). JavaScript evaluatesactiveSession.prompt()first (starting the async chain), thenabortable()races it. When the signal is pre-aborted,abortable()rejects immediately but the floating Promise fromprompt()creates a newAgent._runLoop()with a freshabortControllerthat nobody ever aborts.Why the inner loop never stops:
Agent._runLoop()in pi-agent-core only exits onstopReason === "error" | "aborted". The zombie's signal is never aborted. Tools throwAbortError(from the outerrunAbortController.signal), but this is caught as an error tool result — the model retries indefinitely.Why the circuit breaker doesn't fire: Tool wrapper order is abort-check (outer) → loop-detection (inner). The abort throw short-circuits before the loop detector ever runs.
Proposed 3-layer fix:
abortedstate before callingactiveSession.prompt()— eliminates the floating Promise at sourcefinallyblock: callagent.abort()+agent.clearAllQueues()during attempt cleanup — terminates any escaped Agentagents.defaults.maxLlmCallsPerRun— ultimate safety net independent of abort signal propagation