[Bug] Main session can brick on arcee/trinity-large-thinking: repeated connection errors, TUI spam, Telegram nonresponsive until delayed fallback

## Summary
Switching the **main session** model to `arcee/trinity-large-thinking` can brick the session: the TUI repeatedly shows `connection error`, Telegram becomes nonresponsive, and the main lane stalls until fallback finally fires ~20-24s later. The same model can work in other contexts, which makes this a dangerous weak link in the main-session path.

This is not fixed by uncapping output tokens in config and restarting.

## Severity
High. This degrades the primary interactive session path and effectively wedges both the local TUI and Telegram responsiveness whenever Trinity is selected as the main session model.

## Environment
- OpenClaw runtime: main agent session on macOS
- Model: `arcee/trinity-large-thinking` (alias `trinity`)
- Provider: `arcee`
- Session type affected: **main session**
- Observed date: 2026-04-07

## What we already ruled out
We explicitly tested the obvious config variable and ruled it out:
- Trinity config persists correctly
- output cap was uncapped in config
- gateway restarted cleanly
- failure still persists

So this is **not just a token-cap config issue**.

## User-visible behavior
1. Switch main session to Trinity
2. TUI starts spamming `connection error`
3. Telegram goes dark / nonresponsive
4. Session appears bricked until fallback eventually succeeds on another model

This is not a graceful provider failure. It is a main-session stability failure.

## Fresh log evidence (last ~15 min after uncap + restart)
From `~/.openclaw/logs/gateway.err.log`:

```text
2026-04-07T22:15:40.639-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:15:44.125-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:15:49.538-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:16:00.614-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:16:54.442-04:00 [model-fallback] model fallback decision: decision=candidate_failed requested=arcee/trinity-large-thinking candidate=arcee/trinity-large-thinking reason=overloaded next=openai-codex/gpt-5.4
2026-04-07T22:17:06.013-04:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=arcee/trinity-large-thinking candidate=openai-codex/gpt-5.4 reason=unknown next=none

2026-04-07T22:17:10.473-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:13.901-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:19.323-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:28.686-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:29.704-04:00 [agent] embedded run failover decision: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-
2026-04-07T22:17:29.705-04:00 [diagnostic] lane task error: lane=main durationMs=23552 error="FailoverError: LLM request failed: network connection error."
2026-04-07T22:17:29.705-04:00 [diagnostic] lane task error: lane=session:agent:main:main durationMs=23553 error="FailoverError: LLM request failed: network connection error."
2026-04-07T22:17:29.706-04:00 [model-fallback] model fallback decision: decision=candidate_failed requested=arcee/trinity-large-thinking candidate=arcee/trinity-large-thinking reason=timeout next=openai-codex/gpt-5.4
2026-04-07T22:17:39.723-04:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=arcee/trinity-large-thinking candidate=openai-codex/gpt-5.4 reason=unknown next=none
```

## Important detail
Earlier failures also presented as `Internal server error`, then later as repeated `network connection error` / timeout behavior. So the failure mode appears to have shifted, but the **main-session brick remains**.

Example earlier same-day evidence:
```text
2026-04-07T20:40:19.906-04:00 [agent] embedded run failover decision: runId=9edf7806-201c-4d5f-a565-10a70c454af2 stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-
2026-04-07T20:42:25.974-04:00 [agent] embedded run failover decision: runId=7218d66b-dc0c-415d-8753-7e90d777cf2a stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-
```

## Why this is bad
The current behavior does not fail fast and recover cleanly. Instead it:
- retries repeatedly in the main interactive lane,
- surfaces repeated connection errors to the TUI,
- starves responsiveness on Telegram,
- and only later falls back.

That means one unstable model/provider pairing can effectively poison the main session UX.

## Strong suspicion / likely failure area
One or more of these is still wrong in the main-session path:
1. main-session handling of Trinity/Arcee failures is too sticky and does not fail fast,
2. a hidden runtime cap or request shaping difference still exists in the main lane,
3. Trinity response handling in the main lane differs from subagent lane,
4. provider transport errors are not being isolated from the user-facing session loop.

The key point: **OpenClaw should not allow a model switch to brick the primary session experience.**

## Expected behavior
If Trinity/Arcee is unhealthy for a main session request, OpenClaw should:
- fail fast,
- mark the candidate unhealthy,
- immediately fallback,
- keep TUI responsive,
- keep Telegram responsive,
- and avoid repeating visible `connection error` spam.

## Repro steps
1. Configure `arcee/trinity-large-thinking`
2. Switch the **main session** model to Trinity
3. Send a normal main-session prompt
4. Observe repeated `connection error` in TUI and stalled Telegram responsiveness
5. Wait ~20-24s for eventual fallback to another model

## Request
Please treat this as a stability bug in the main-session lane, not a cosmetic provider hiccup. A broken model/provider should degrade gracefully, not wedge the user’s primary session surfaces.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Main session can brick on arcee/trinity-large-thinking: repeated connection errors, TUI spam, Telegram nonresponsive until delayed fallback #62847

Summary

Severity

Environment

What we already ruled out

User-visible behavior

Fresh log evidence (last ~15 min after uncap + restart)

Important detail

Why this is bad

Strong suspicion / likely failure area

Expected behavior

Repro steps

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Main session can brick on arcee/trinity-large-thinking: repeated connection errors, TUI spam, Telegram nonresponsive until delayed fallback #62847

Description

Summary

Severity

Environment

What we already ruled out

User-visible behavior

Fresh log evidence (last ~15 min after uncap + restart)

Important detail

Why this is bad

Strong suspicion / likely failure area

Expected behavior

Repro steps

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions