Roadmap: build long-session evals for coherence and context handling

## Summary
We need a repeatable evaluation harness that proves our context-management approach is actually helping on long coding sessions, especially compared with a simpler "summarize old messages" baseline.

This issue is about building that evidence in a way that other agents can run and compare over time.

## Scope
- Add long-session eval scenarios that stress:
  - heavy tool use
  - long back-and-forth coding sessions
  - resume/fork flows
  - compaction during active work
  - recovery after tool errors or interruptions
- Add a simple baseline compaction mode for comparison in evals
- Measure continuity and usefulness, for example:
  - does the agent remember the right files and goals?
  - does it recover cleanly after context refresh?
  - does it keep track of recent tool work?
- Produce a human-readable report that makes regressions easy to spot

## Acceptance Criteria
- Eval scenarios are automated and reproducible
- There is at least one baseline path to compare against
- Results are summarized in plain language, not just raw counters
- The repo includes guidance for running the evals locally and in CI

## Starting Points
- `crates/tui/src/eval.rs`
- `crates/tui/tests/eval_harness.rs`
- `docs/capacity_controller.md`
- `crates/tui/src/compaction.rs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: build long-session evals for coherence and context handling #7

Summary

Scope

Acceptance Criteria

Starting Points

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Roadmap: build long-session evals for coherence and context handling #7

Description

Summary

Scope

Acceptance Criteria

Starting Points

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions