Bug: Compaction failure leaves session in permanent failed state with no automatic recovery
Summary
When session compaction fails (for any reason: timeout, API error, quota exceeded, model not supporting reasoning, etc.), the session enters status=failed permanently. There is no watchdog or automatic recovery mechanism — the channel becomes completely unresponsive (已读不回), and the only workaround is manual intervention.
Steps to reproduce
- Have a session that grows large enough to trigger compaction (~500KB+ JSONL)
- Compaction fails (e.g., model returns error, or times out)
- Session
status becomes "failed"
- All subsequent messages to that channel receive no response
- No automatic escalation, no fallback session, no user notification
Observed log
Session e1bc0eb2 at 04:11-04:14:
- Compaction triggered on large session
- Fallback model returned 400 (reasoning required but disabled in system prompt)
- Multiple retry attempts → 429 rate limit
- Session grew even larger from error messages
- Compaction failed →
status=failed → no further responses
- Auto-recovery via external heartbeat watchdog was the only way out
Expected behavior
Compaction failure should have automatic escalation:
- Try alternative compaction model
- If all models fail, create a new session automatically and notify the user
- Never leave a session permanently dead with no response
Impact
- High: Users experience '已读不回' with no explanation
- Data loss feel: users don't know history was preserved in
.bak.recovered.* files
- Depends on external heartbeat watchdog to recover — not a proper fix
References
Suggested fix
Add a compaction-failure watchdog in the gateway that:
- Detects when compaction has failed
- Automatically creates a new session for the channel
- Optionally preserves a backup of the old session file
- Notifies the user that a new session was started
Bug: Compaction failure leaves session in permanent failed state with no automatic recovery
Summary
When session compaction fails (for any reason: timeout, API error, quota exceeded, model not supporting reasoning, etc.), the session enters
status=failedpermanently. There is no watchdog or automatic recovery mechanism — the channel becomes completely unresponsive (已读不回), and the only workaround is manual intervention.Steps to reproduce
statusbecomes"failed"Observed log
Session
e1bc0eb2at 04:11-04:14:status=failed→ no further responsesExpected behavior
Compaction failure should have automatic escalation:
Impact
.bak.recovered.*filesReferences
statusfield values ("failed", "timeout", "done") mislead agents into spawning duplicate sessions #64103 (session status field misleading — failed status spawns duplicates)Suggested fix
Add a compaction-failure watchdog in the gateway that: