Skip to content

[Bug]: Discord agent session remains routable after timeout, causing partial-success plus generic failure #72810

@vishutdhar

Description

@vishutdhar

Summary

A Discord-routed agent turn can complete useful side effects, then remain stuck in processing until the CLI timeout fires. OpenClaw then surfaces the generic user-facing failure message even though the work may already have been posted/applied. Later routing can still appear to target the wedged session, which makes verifier/worker state ambiguous and can trigger redundant follow-up dispatches.

This may be related to the claude-cli regression tracked in #72434, but the problematic behavior here is the session-health/routing outcome after a terminal timeout.

Environment

  • OpenClaw: 2026.4.24 from npm stable
  • Channel: Discord
  • Agent model: anthropic/claude-opus-4-7 via the Claude CLI-backed path
  • OS: macOS

Observed behavior

  1. A Discord agent turn starts and performs useful side effects. In the observed case, a review/verdict message and a local state update were successfully recorded.
  2. The same session remains in processing and is reported as stuck for several minutes.
  3. At the 900s CLI timeout, OpenClaw terminates the candidate and posts/surfaces the generic failure text:
Something went wrong while processing your request. Please try again, or use /new to start a fresh session.
  1. Follow-up routing is ambiguous: the agent looks like it can still receive work, but the session is effectively dead/wedged. A later verification had to be recovered from a separate route, while a redundant follow-up dispatch was created because the original verifier path looked silent.

Sanitized log shape

[diagnostic] lane task error: lane=session:agent:<agent>:discord:channel:<redacted>:active-memory:<redacted> durationMs=<small> error="Error: Requested agent harness "claude-cli" is not registered and PI fallback is disabled."
[diagnostic] stuck session: sessionId=unknown sessionKey=agent:<agent>:discord:channel:<redacted> state=processing age=<minutes>s queueDepth=1
[model-fallback/decision] model fallback decision: decision=candidate_failed requested=anthropic/claude-opus-4-7 candidate=anthropic/claude-opus-4-7 reason=timeout next=none detail=CLI exceeded timeout (900s) and was terminated.
Embedded agent failed before reply: CLI exceeded timeout (900s) and was terminated.

Expected behavior

After a fatal timeout or pre-reply embedded-agent failure, OpenClaw should make the session health unambiguous. Any of these would be safer than silently continuing to route to the wedged session:

  • mark the session failed/dead and require /new,
  • automatically reset/roll the session before accepting more work,
  • route the next turn to a fresh session,
  • or surface a clear session timed out; previous side effects may have completed state instead of only the generic failure message.

If side effects completed before the final timeout, the user-facing state should distinguish partial-success/late-failure from total failure.

Impact

  • Users cannot tell whether the work failed or succeeded.
  • Verifier/worker workflows can create duplicate dispatches because the original route appears silent.
  • A watchdog sees processing for many minutes but the user-facing chat only gets a generic failure at the end.
  • The recovery path becomes manual: inspect logs/state, identify whether side effects completed, and route a fresh verifier/session by hand.

Redaction note

This report intentionally redacts Discord IDs, session IDs, dispatch IDs, local paths, project names, internal agent nicknames, and exact local timestamps. The included log snippets preserve only the error shape needed to diagnose the runtime behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions