Skip to content

[Bug] Main session can brick on arcee/trinity-large-thinking: repeated connection errors, TUI spam, Telegram nonresponsive until delayed fallback #62847

@Adam-Researchh

Description

@Adam-Researchh

Summary

Switching the main session model to arcee/trinity-large-thinking can brick the session: the TUI repeatedly shows connection error, Telegram becomes nonresponsive, and the main lane stalls until fallback finally fires ~20-24s later. The same model can work in other contexts, which makes this a dangerous weak link in the main-session path.

This is not fixed by uncapping output tokens in config and restarting.

Severity

High. This degrades the primary interactive session path and effectively wedges both the local TUI and Telegram responsiveness whenever Trinity is selected as the main session model.

Environment

  • OpenClaw runtime: main agent session on macOS
  • Model: arcee/trinity-large-thinking (alias trinity)
  • Provider: arcee
  • Session type affected: main session
  • Observed date: 2026-04-07

What we already ruled out

We explicitly tested the obvious config variable and ruled it out:

  • Trinity config persists correctly
  • output cap was uncapped in config
  • gateway restarted cleanly
  • failure still persists

So this is not just a token-cap config issue.

User-visible behavior

  1. Switch main session to Trinity
  2. TUI starts spamming connection error
  3. Telegram goes dark / nonresponsive
  4. Session appears bricked until fallback eventually succeeds on another model

This is not a graceful provider failure. It is a main-session stability failure.

Fresh log evidence (last ~15 min after uncap + restart)

From ~/.openclaw/logs/gateway.err.log:

2026-04-07T22:15:40.639-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:15:44.125-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:15:49.538-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:16:00.614-04:00 [agent] embedded run agent end: runId=c40546af-1962-4622-ac32-cff3b3006ba9 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:16:54.442-04:00 [model-fallback] model fallback decision: decision=candidate_failed requested=arcee/trinity-large-thinking candidate=arcee/trinity-large-thinking reason=overloaded next=openai-codex/gpt-5.4
2026-04-07T22:17:06.013-04:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=arcee/trinity-large-thinking candidate=openai-codex/gpt-5.4 reason=unknown next=none

2026-04-07T22:17:10.473-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:13.901-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:19.323-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:28.686-04:00 [agent] embedded run agent end: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 isError=true model=trinity-large-thinking provider=arcee error=LLM request failed: network connection error. rawError=Connection error.
2026-04-07T22:17:29.704-04:00 [agent] embedded run failover decision: runId=7b00bf0c-c380-43a3-ab31-93cef91d2346 stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-
2026-04-07T22:17:29.705-04:00 [diagnostic] lane task error: lane=main durationMs=23552 error="FailoverError: LLM request failed: network connection error."
2026-04-07T22:17:29.705-04:00 [diagnostic] lane task error: lane=session:agent:main:main durationMs=23553 error="FailoverError: LLM request failed: network connection error."
2026-04-07T22:17:29.706-04:00 [model-fallback] model fallback decision: decision=candidate_failed requested=arcee/trinity-large-thinking candidate=arcee/trinity-large-thinking reason=timeout next=openai-codex/gpt-5.4
2026-04-07T22:17:39.723-04:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=arcee/trinity-large-thinking candidate=openai-codex/gpt-5.4 reason=unknown next=none

Important detail

Earlier failures also presented as Internal server error, then later as repeated network connection error / timeout behavior. So the failure mode appears to have shifted, but the main-session brick remains.

Example earlier same-day evidence:

2026-04-07T20:40:19.906-04:00 [agent] embedded run failover decision: runId=9edf7806-201c-4d5f-a565-10a70c454af2 stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-
2026-04-07T20:42:25.974-04:00 [agent] embedded run failover decision: runId=7218d66b-dc0c-415d-8753-7e90d777cf2a stage=assistant decision=fallback_model reason=timeout provider=arcee/trinity-large-thinking profile=-

Why this is bad

The current behavior does not fail fast and recover cleanly. Instead it:

  • retries repeatedly in the main interactive lane,
  • surfaces repeated connection errors to the TUI,
  • starves responsiveness on Telegram,
  • and only later falls back.

That means one unstable model/provider pairing can effectively poison the main session UX.

Strong suspicion / likely failure area

One or more of these is still wrong in the main-session path:

  1. main-session handling of Trinity/Arcee failures is too sticky and does not fail fast,
  2. a hidden runtime cap or request shaping difference still exists in the main lane,
  3. Trinity response handling in the main lane differs from subagent lane,
  4. provider transport errors are not being isolated from the user-facing session loop.

The key point: OpenClaw should not allow a model switch to brick the primary session experience.

Expected behavior

If Trinity/Arcee is unhealthy for a main session request, OpenClaw should:

  • fail fast,
  • mark the candidate unhealthy,
  • immediately fallback,
  • keep TUI responsive,
  • keep Telegram responsive,
  • and avoid repeating visible connection error spam.

Repro steps

  1. Configure arcee/trinity-large-thinking
  2. Switch the main session model to Trinity
  3. Send a normal main-session prompt
  4. Observe repeated connection error in TUI and stalled Telegram responsiveness
  5. Wait ~20-24s for eventual fallback to another model

Request

Please treat this as a stability bug in the main-session lane, not a cosmetic provider hiccup. A broken model/provider should degrade gracefully, not wedge the user’s primary session surfaces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions