[Discussion] Production reliability gaps: silent empty replies, MCP timeout loops, and missing lifecycle/status UX

## Summary

First, thank you for OpenClaw — we are actively using it in real multi-channel workflows (QQ + Telegram + MCP/browser tasks), and it is very capable.

This is a **comprehensive field report** from continuous production-like usage, combining several community posts and long troubleshooting sessions.

The goal is not to complain, but to provide actionable reliability improvements.

---

## Community references consolidated

- LinuxDo: https://linux.do/t/topic/1610098
- AIYA #60 (Chrome MCP gets trapped in repeated permission loops): https://aiya.de5.net/t/topic/60
- AIYA #61 (agent over-focuses MCP/skill path; weak "look-at-screen" behavior): https://aiya.de5.net/t/topic/61
- AIYA #62 (no proactive pre-restart/pre-disconnect notice): https://aiya.de5.net/t/topic/62
- AIYA #63 (long, detailed empty-reply root-cause investigation): https://aiya.de5.net/t/topic/63

Related upstream items:

- #14064 (silent empty replies in large session)
- #5771 (context overflow)
- #14157 (detect silent overflow branch)

---

## Part A — Silent empty replies in large sessions (already known, still critical)

Observed repeatedly:

- very high `usage.input` (e.g. 260k–300k+)
- `usage.output = 0`
- assistant `content = []`
- user sees "no reply" without a clear reason

Likely trigger in our case:

- long sessions + multiple large `toolResult` payloads (web/search dumps) persisted into history

This part is covered by #14064 / #14157 direction, and we support that fix.

---

## Part B — MCP/browser timeout loops need better failure diagnosis

From topic #60/#61 and direct observation:

- agent can repeatedly retry MCP/browser actions without realizing the true blocker is on-screen state (e.g., system permission dialog / password prompt)
- behavior becomes "retry loop" instead of diagnosis

### Requested improvement

When MCP/browser action times out or repeats failure N times:

1. force one screenshot/state inspection step
2. summarize visible blocker candidates (permission dialog, auth prompt, blocked modal, etc.)
3. ask targeted user confirmation before repeating the same action

In short: **timeout should trigger state inspection, not blind retry**.

---

## Part C — Need proactive lifecycle notices (restart/disconnect/network changes)

From topic #62 and operational experience:

- users interpret silence as model/channel failure
- gateway/channel restarts or network interruptions currently feel "sudden"

### Requested improvement

Add optional pre-event hooks / notifications:

- before gateway restart
- before channel reconnection/disconnect
- after recovery ("back online")

This is especially important for chat-first deployments.

---

## Part D — Better user-visible runtime observability

A practical pain point: users cannot easily tell if agent is still working, blocked, or dead.

### Requested improvement

Expose lightweight status that channels can render:

- `idle / thinking / tool-running / waiting-user / error`
- queued task count
- last successful output timestamp
- recent failure reason (short)
- session context pressure indicator (e.g., 72%)

This reduces repeated "are you alive?" messages and unnecessary retries.

---

## Part E — Guardrails for context growth (prevention, not only recovery)

Beyond silent-overflow detection, we strongly suggest prevention:

1. preflight token budget check before model call
2. auto-compact or block with clear guidance when over threshold
3. cap/trim huge `toolResult` by default (summary + links), archive raw separately

Without this, long-running agent sessions eventually degrade in real usage.

---

## Why this should be tracked as one umbrella issue

These are not isolated edge cases; they form one reliability chain:

- tool/MCP loops increase context and user confusion
- context grows silently
- empty replies appear without clear guidance
- restart/disconnect events are not proactively communicated
- users cannot distinguish "busy" vs "broken"

Solving only one node (e.g., silent overflow detection) helps, but does not fully resolve production UX.

---

## Suggested implementation order

1. merge silent-overflow handling (`length + output=0`) — immediate pain relief
2. add preflight token guard + big toolResult truncation policy
3. add timeout-triggered screenshot/state diagnosis for MCP/browser flows
4. add lifecycle pre-notification hooks (restart/disconnect/recover)
5. add channel-friendly runtime status API

---

## Offer to help

If useful, we can provide:

- sanitized logs with token progression timeline
- reproducible MCP timeout-loop traces
- before/after behavior from plugin-level mitigations (QQ integration)

Thanks again for the project and for considering this comprehensive feedback.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] Production reliability gaps: silent empty replies, MCP timeout loops, and missing lifecycle/status UX #14797

Summary

Community references consolidated

Part A — Silent empty replies in large sessions (already known, still critical)

Part B — MCP/browser timeout loops need better failure diagnosis

Requested improvement

Part C — Need proactive lifecycle notices (restart/disconnect/network changes)

Requested improvement

Part D — Better user-visible runtime observability

Requested improvement

Part E — Guardrails for context growth (prevention, not only recovery)

Why this should be tracked as one umbrella issue

Suggested implementation order

Offer to help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion] Production reliability gaps: silent empty replies, MCP timeout loops, and missing lifecycle/status UX #14797

Description

Summary

Community references consolidated

Part A — Silent empty replies in large sessions (already known, still critical)

Part B — MCP/browser timeout loops need better failure diagnosis

Requested improvement

Part C — Need proactive lifecycle notices (restart/disconnect/network changes)

Requested improvement

Part D — Better user-visible runtime observability

Requested improvement

Part E — Guardrails for context growth (prevention, not only recovery)

Why this should be tracked as one umbrella issue

Suggested implementation order

Offer to help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions