-
-
Notifications
You must be signed in to change notification settings - Fork 67.2k
Description
Summary
First, thank you for OpenClaw — we are actively using it in real multi-channel workflows (QQ + Telegram + MCP/browser tasks), and it is very capable.
This is a comprehensive field report from continuous production-like usage, combining several community posts and long troubleshooting sessions.
The goal is not to complain, but to provide actionable reliability improvements.
Community references consolidated
- LinuxDo: https://linux.do/t/topic/1610098
- AIYA Gateway fails to start - hardcoded dev paths in bundled binary (ENOENT /Users/steipete/Projects/clawdis/...) #60 (Chrome MCP gets trapped in repeated permission loops): https://aiya.de5.net/t/topic/60
- AIYA Restart command doesn’t work with Linux #61 (agent over-focuses MCP/skill path; weak "look-at-screen" behavior): https://aiya.de5.net/t/topic/61
- AIYA feat: add codex and opencode CLI skills #62 (no proactive pre-restart/pre-disconnect notice): https://aiya.de5.net/t/topic/62
- AIYA Make built-in skills disabled by default. #63 (long, detailed empty-reply root-cause investigation): https://aiya.de5.net/t/topic/63
Related upstream items:
- [Bug]: Session exceeding context window produces silent empty replies — no compaction triggered #14064 (silent empty replies in large session)
- [Bug]: Context overflow error #5771 (context overflow)
- fix(agents): detect silent context overflow (stopReason=length, output=0) #14157 (detect silent overflow branch)
Part A — Silent empty replies in large sessions (already known, still critical)
Observed repeatedly:
- very high
usage.input(e.g. 260k–300k+) usage.output = 0- assistant
content = [] - user sees "no reply" without a clear reason
Likely trigger in our case:
- long sessions + multiple large
toolResultpayloads (web/search dumps) persisted into history
This part is covered by #14064 / #14157 direction, and we support that fix.
Part B — MCP/browser timeout loops need better failure diagnosis
From topic #60/#61 and direct observation:
- agent can repeatedly retry MCP/browser actions without realizing the true blocker is on-screen state (e.g., system permission dialog / password prompt)
- behavior becomes "retry loop" instead of diagnosis
Requested improvement
When MCP/browser action times out or repeats failure N times:
- force one screenshot/state inspection step
- summarize visible blocker candidates (permission dialog, auth prompt, blocked modal, etc.)
- ask targeted user confirmation before repeating the same action
In short: timeout should trigger state inspection, not blind retry.
Part C — Need proactive lifecycle notices (restart/disconnect/network changes)
From topic #62 and operational experience:
- users interpret silence as model/channel failure
- gateway/channel restarts or network interruptions currently feel "sudden"
Requested improvement
Add optional pre-event hooks / notifications:
- before gateway restart
- before channel reconnection/disconnect
- after recovery ("back online")
This is especially important for chat-first deployments.
Part D — Better user-visible runtime observability
A practical pain point: users cannot easily tell if agent is still working, blocked, or dead.
Requested improvement
Expose lightweight status that channels can render:
idle / thinking / tool-running / waiting-user / error- queued task count
- last successful output timestamp
- recent failure reason (short)
- session context pressure indicator (e.g., 72%)
This reduces repeated "are you alive?" messages and unnecessary retries.
Part E — Guardrails for context growth (prevention, not only recovery)
Beyond silent-overflow detection, we strongly suggest prevention:
- preflight token budget check before model call
- auto-compact or block with clear guidance when over threshold
- cap/trim huge
toolResultby default (summary + links), archive raw separately
Without this, long-running agent sessions eventually degrade in real usage.
Why this should be tracked as one umbrella issue
These are not isolated edge cases; they form one reliability chain:
- tool/MCP loops increase context and user confusion
- context grows silently
- empty replies appear without clear guidance
- restart/disconnect events are not proactively communicated
- users cannot distinguish "busy" vs "broken"
Solving only one node (e.g., silent overflow detection) helps, but does not fully resolve production UX.
Suggested implementation order
- merge silent-overflow handling (
length + output=0) — immediate pain relief - add preflight token guard + big toolResult truncation policy
- add timeout-triggered screenshot/state diagnosis for MCP/browser flows
- add lifecycle pre-notification hooks (restart/disconnect/recover)
- add channel-friendly runtime status API
Offer to help
If useful, we can provide:
- sanitized logs with token progression timeline
- reproducible MCP timeout-loop traces
- before/after behavior from plugin-level mitigations (QQ integration)
Thanks again for the project and for considering this comprehensive feedback.