Skip to content

[Bug]: async task completion reports can be lost because system event/wake is not reliably session-targeted #52305

@lazica-hub

Description

@lazica-hub

Bug type

Bug / async completion routing

Summary

Async task completion reporting is unreliable when external task runners (for example Codex via exec) try to notify OpenClaw with:

openclaw system event --text "...done..." --mode now

In practice this can fail in two ways:

  1. the CLI call itself fails with local gateway websocket errors such as:
    • gateway closed (1006 abnormal closure)
    • target ws://127.0.0.1:18789
  2. even when the call succeeds conceptually, system event / wake is not session-targeted, so completion reporting does not reliably route back to the originating user conversation

This makes long-running background tasks feel "done but never reported" unless the user asks again.

Environment

  • OpenClaw: 2026.3.13 (61d171a)
  • OS: macOS
  • Gateway mode: local
  • Gateway bind: loopback
  • Messaging surface: Telegram direct chat
  • Typical runner: exec background task launching Codex / external orchestrator

What I observed

I reproduced this with a simple background coding task:

  1. Start a long-ish Codex task via background exec
  2. Ask Codex to run this on completion:
openclaw system event --text "Done: Built a concise responsive login page in the temp project directory" --mode now
  1. The coding task finishes successfully
  2. The completion notification does not arrive in the originating Telegram conversation

In one captured run, Codex did execute the command, but it failed with:

gateway closed (1006 abnormal closure)
Gateway target: ws://127.0.0.1:18789

At the same time the actual task output/files were present, confirming the work completed.

Root cause analysis

After tracing the current implementation, this looks like a product/architecture gap rather than only a single transport glitch:

A. openclaw system event is implemented as a wake

The CLI path does not act like a reliable completion callback. It effectively does:

  • enqueue system event text
  • request heartbeat

B. wake is not session-targeted

The current wake path does not carry sessionKey in the relevant CLI flow.

That means the event is not reliably bound to the originating conversation that launched the async task.

C. heartbeat defaults to the agent's main session when no forced session is provided

So even if the wake/system event path works, it does not guarantee delivery back to the original user thread / DM that triggered the task.

D. local websocket fragility makes it worse

From external task runners, the local gateway websocket path can also fail with 1006 abnormal closure, so the fallback notification bridge is itself not reliable.

Why this matters

This creates a bad UX for background tasks:

  • task actually completes
  • OpenClaw may know something happened
  • user still gets no completion report
  • user has to manually ask "is it done?"

This is especially noticeable for:

  • Codex / ACP tasks launched from chat
  • background exec jobs
  • external orchestrators like ClawTeam

Expected behavior

At least one of these should be true:

  1. openclaw system event / wake supports an explicit sessionKey and reliably wakes the originating session
  2. async exec completion events preserve originating session context automatically
  3. there is a first-class completion notification path for background tasks that can deliver to the originating channel/session without depending on main-session heartbeat inference

Related work already in the repo

This seems closely related to:

Suggestion

I suspect the real fix is not just transport retry. The bigger gap is that system event / wake is currently used as if it were a completion callback, but it is really an internal wake/heartbeat mechanism.

So the best fix is probably one or both of:

  • explicit session targeting for wake/system-event entry points
  • a first-class completion notification mechanism for async/background tasks

If useful, I can provide a more detailed repro timeline and the exact local logs / Codex transcript snippets that showed:

  • successful task completion
  • attempted openclaw system event
  • failure with gateway closed (1006 abnormal closure)
  • no proactive Telegram completion report

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions