Summary
We need a repeatable evaluation harness that proves our context-management approach is actually helping on long coding sessions, especially compared with a simpler "summarize old messages" baseline.
This issue is about building that evidence in a way that other agents can run and compare over time.
Scope
- Add long-session eval scenarios that stress:
- heavy tool use
- long back-and-forth coding sessions
- resume/fork flows
- compaction during active work
- recovery after tool errors or interruptions
- Add a simple baseline compaction mode for comparison in evals
- Measure continuity and usefulness, for example:
- does the agent remember the right files and goals?
- does it recover cleanly after context refresh?
- does it keep track of recent tool work?
- Produce a human-readable report that makes regressions easy to spot
Acceptance Criteria
- Eval scenarios are automated and reproducible
- There is at least one baseline path to compare against
- Results are summarized in plain language, not just raw counters
- The repo includes guidance for running the evals locally and in CI
Starting Points
crates/tui/src/eval.rs
crates/tui/tests/eval_harness.rs
docs/capacity_controller.md
crates/tui/src/compaction.rs
Summary
We need a repeatable evaluation harness that proves our context-management approach is actually helping on long coding sessions, especially compared with a simpler "summarize old messages" baseline.
This issue is about building that evidence in a way that other agents can run and compare over time.
Scope
Acceptance Criteria
Starting Points
crates/tui/src/eval.rscrates/tui/tests/eval_harness.rsdocs/capacity_controller.mdcrates/tui/src/compaction.rs