Skip to content

[Bug]: abortable(activeSession.prompt()) creates zombie Agent loop when signal is pre-aborted #74859

@zhumengzhu

Description

@zhumengzhu

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When params.abortSignal is already aborted before activeSession.prompt() is called (e.g. rapid consecutive messages with messages.queue.mode: "interrupt"), abortable() immediately rejects but the prompt() async chain has already started. The floating Promise creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts, causing the Agent to loop indefinitely calling the LLM after the attempt has exited. Observed: 2617 LLM calls over 103 minutes from a single zombie run.

Steps to reproduce

  1. Configure messages.queue.mode: "interrupt" in openclaw.json
  2. Send a message to the agent
  3. Within < 1 second, send a second message (interrupt mode aborts the first run)
  4. Observe that the first run's Agent continues calling the LLM in the background after the attempt has exited

Alternatively, run the reproduction test below which uses a pre-aborted AbortSignal to simulate the same condition deterministically.

agent-zombie-loop.test.ts (click to expand)
import { Agent, type AgentMessage } from "@mariozechner/pi-agent-core";
import type { Api, Message, Model } from "@mariozechner/pi-ai";
import { afterEach, beforeEach, describe, expect, it } from "vitest";
import {
  createDefaultEmbeddedSession,
  getHoisted,
  resetEmbeddedAttemptHarness,
  testModel,
} from "./attempt.spawn-workspace.test-support.js";

const sleep = (ms: number) => new Promise<void>((r) => setTimeout(r, ms));
const mockModel = testModel as unknown as Model<Api>;

const mockTool = {
  name: "mock_tool",
  label: "Mock Tool",
  description: "mock",
  parameters: { type: "object" as const, properties: {} },
  execute: async () => ({ content: [{ type: "text" as const, text: "Aborted" }], details: {} }),
};

function createToolUseStreamFn(tracker: { count: number }) {
  return async (_model: unknown, _context: unknown, options?: { signal?: AbortSignal }) => {
    tracker.count += 1;
    await sleep(5);
    if (options?.signal?.aborted) {
      const err = new Error("Request was aborted.");
      err.name = "AbortError";
      throw err;
    }
    const message = {
      role: "assistant" as const,
      content: [
        { type: "toolCall" as const, id: `call_${tracker.count}`, name: "mock_tool", arguments: {} },
      ],
      usage: { input: 70, output: 51, cacheRead: 0, cacheWrite: 0, totalTokens: 121, cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 } },
      stopReason: "toolUse" as const,
      timestamp: Date.now(),
    };
    return {
      [Symbol.asyncIterator]() {
        let done = false;
        return { async next() { if (!done) { done = true; return { done: false, value: { type: "done", message } }; } return { done: true, value: undefined }; } };
      },
      async result() { return message; },
    } as never;
  };
}

const hoisted = getHoisted();

describe("Agent zombie loop (upstream bug)", () => {
  beforeEach(() => { resetEmbeddedAttemptHarness(); });

  it("bug: abort-before-prompt produces floating Promise, Agent loops after attempt exits", { timeout: 10_000 }, async () => {
    const tracker = { count: 0 };
    const agent = new Agent({
      initialState: { systemPrompt: "test", model: mockModel, tools: [mockTool] },
      streamFn: createToolUseStreamFn(tracker),
      convertToLlm: (msgs: AgentMessage[]): Message[] =>
        msgs.filter((m) => ["user", "assistant", "toolResult"].includes(m.role)) as Message[],
    });

    hoisted.createAgentSessionMock.mockResolvedValue({
      session: createDefaultEmbeddedSession({
        prompt: async (_session, prompt) => {
          agent.prompt(prompt).catch(() => {});
          await sleep(50);
        },
      }),
    });

    const abortSignal = AbortSignal.abort(new Error("second message arrived"));
    const { runEmbeddedAttempt } = await import("./attempt.js");

    await runEmbeddedAttempt({
      sessionId: "zombie-test", sessionKey: "agent:main:main",
      sessionFile: "/tmp/zombie-test.jsonl", workspaceDir: "/tmp", agentDir: "/tmp",
      config: {}, prompt: "first message", timeoutMs: 5_000, runId: "zombie-run",
      provider: "openai", modelId: "gpt-test", model: mockModel,
      authStorage: { getApiKey: async () => undefined } as never,
      modelRegistry: {} as never, thinkLevel: "off",
      senderIsOwner: true, disableMessageTool: true, abortSignal,
    });

    const countAtExit = tracker.count;
    await sleep(500);
    const countAfterWait = tracker.count;

    console.log(`LLM calls at exit=${countAtExit}, after 500ms=${countAfterWait}, delta=${countAfterWait - countAtExit}`);
    expect(countAfterWait).toBeGreaterThan(countAtExit);

    agent.abort();
    agent.clearAllQueues?.();
    await agent.waitForIdle();
  });
});

Expected behavior

When a run is aborted (via interrupt mode, timeout, or RPC), the Agent should stop all LLM calls promptly. No floating Promises should outlive the attempt lifecycle.

Actual behavior

The Agent continues calling the LLM indefinitely after the attempt has returned. Each iteration: ~90k input tokens + ~35 output tokens, stopReason always toolUse, tools always throw AbortError (caught as error result), model retries the same tool call. Loop never terminates unless the process restarts.

OpenClaw version

All releases since v2026.1.20 (bug introduced in commit 016693a1f on 2026-01-18)

Operating system

Linux (also reproducible on macOS)

Install method

pnpm dev / npm global

Model

Any model (bug is model-agnostic; the loop is in the Agent runtime, not the LLM)

Provider / routing chain

Any provider (bug is provider-agnostic)

Additional provider/model setup details

NOT_ENOUGH_INFO

Logs, screenshots, and evidence

Production observations across 3 independent cases:

Case Trigger Duration LLM calls
1 timeout-compaction retry 76 min ~2130
2 timeout-compaction retry 2+ hours ~952 (log truncated)
3 user rapid messages (652ms apart) 103 min 2617

Log signature of a zombie run:

  • embedded run prompt end durationMs=<very small, e.g. 22-26ms> (abortable() rejected immediately)
  • Continued model.usage stopReason=toolUse lines after run cleanup for the same runId
  • All tool results are "Aborted" (error result)
  • embedded run done never appears

Impact and severity

Affected: Any user with messages.queue.mode: "interrupt" who sends rapid consecutive messages
Severity: High — silent resource drain, potential large API cost
Frequency: Near-deterministic with interrupt mode + rapid messages; lower probability via timeout-compaction
Consequence: Unbounded LLM API cost, server resource exhaustion, no user-visible indication of the problem

Additional information

Root cause: await abortable(activeSession.prompt(effectivePrompt)) in attempt.ts (introduced in 016693a1f). JavaScript evaluates activeSession.prompt() first (starting the async chain), then abortable() races it. When the signal is pre-aborted, abortable() rejects immediately but the floating Promise from prompt() creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts.

Why the inner loop never stops: Agent._runLoop() in pi-agent-core only exits on stopReason === "error" | "aborted". The zombie's signal is never aborted. Tools throw AbortError (from the outer runAbortController.signal), but this is caught as an error tool result — the model retries indefinitely.

Why the circuit breaker doesn't fire: Tool wrapper order is abort-check (outer) → loop-detection (inner). The abort throw short-circuits before the loop detector ever runs.

Proposed 3-layer fix:

  1. Pre-prompt guard: check aborted state before calling activeSession.prompt() — eliminates the floating Promise at source
  2. finally block: call agent.abort() + agent.clearAllQueues() during attempt cleanup — terminates any escaped Agent
  3. Per-run LLM call hard cap: shared counter across attempts, configurable via agents.defaults.maxLlmCallsPerRun — ultimate safety net independent of abort signal propagation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions