Skip to content

[Discussion] Production reliability gaps: silent empty replies, MCP timeout loops, and missing lifecycle/status UX #14797

@constansino

Description

@constansino

Summary

First, thank you for OpenClaw — we are actively using it in real multi-channel workflows (QQ + Telegram + MCP/browser tasks), and it is very capable.

This is a comprehensive field report from continuous production-like usage, combining several community posts and long troubleshooting sessions.

The goal is not to complain, but to provide actionable reliability improvements.


Community references consolidated

Related upstream items:


Part A — Silent empty replies in large sessions (already known, still critical)

Observed repeatedly:

  • very high usage.input (e.g. 260k–300k+)
  • usage.output = 0
  • assistant content = []
  • user sees "no reply" without a clear reason

Likely trigger in our case:

  • long sessions + multiple large toolResult payloads (web/search dumps) persisted into history

This part is covered by #14064 / #14157 direction, and we support that fix.


Part B — MCP/browser timeout loops need better failure diagnosis

From topic #60/#61 and direct observation:

  • agent can repeatedly retry MCP/browser actions without realizing the true blocker is on-screen state (e.g., system permission dialog / password prompt)
  • behavior becomes "retry loop" instead of diagnosis

Requested improvement

When MCP/browser action times out or repeats failure N times:

  1. force one screenshot/state inspection step
  2. summarize visible blocker candidates (permission dialog, auth prompt, blocked modal, etc.)
  3. ask targeted user confirmation before repeating the same action

In short: timeout should trigger state inspection, not blind retry.


Part C — Need proactive lifecycle notices (restart/disconnect/network changes)

From topic #62 and operational experience:

  • users interpret silence as model/channel failure
  • gateway/channel restarts or network interruptions currently feel "sudden"

Requested improvement

Add optional pre-event hooks / notifications:

  • before gateway restart
  • before channel reconnection/disconnect
  • after recovery ("back online")

This is especially important for chat-first deployments.


Part D — Better user-visible runtime observability

A practical pain point: users cannot easily tell if agent is still working, blocked, or dead.

Requested improvement

Expose lightweight status that channels can render:

  • idle / thinking / tool-running / waiting-user / error
  • queued task count
  • last successful output timestamp
  • recent failure reason (short)
  • session context pressure indicator (e.g., 72%)

This reduces repeated "are you alive?" messages and unnecessary retries.


Part E — Guardrails for context growth (prevention, not only recovery)

Beyond silent-overflow detection, we strongly suggest prevention:

  1. preflight token budget check before model call
  2. auto-compact or block with clear guidance when over threshold
  3. cap/trim huge toolResult by default (summary + links), archive raw separately

Without this, long-running agent sessions eventually degrade in real usage.


Why this should be tracked as one umbrella issue

These are not isolated edge cases; they form one reliability chain:

  • tool/MCP loops increase context and user confusion
  • context grows silently
  • empty replies appear without clear guidance
  • restart/disconnect events are not proactively communicated
  • users cannot distinguish "busy" vs "broken"

Solving only one node (e.g., silent overflow detection) helps, but does not fully resolve production UX.


Suggested implementation order

  1. merge silent-overflow handling (length + output=0) — immediate pain relief
  2. add preflight token guard + big toolResult truncation policy
  3. add timeout-triggered screenshot/state diagnosis for MCP/browser flows
  4. add lifecycle pre-notification hooks (restart/disconnect/recover)
  5. add channel-friendly runtime status API

Offer to help

If useful, we can provide:

  • sanitized logs with token progression timeline
  • reproducible MCP timeout-loop traces
  • before/after behavior from plugin-level mitigations (QQ integration)

Thanks again for the project and for considering this comprehensive feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions