Skip to content

Roadmap: build long-session evals for coherence and context handling #7

@Hmbown

Description

@Hmbown

Summary

We need a repeatable evaluation harness that proves our context-management approach is actually helping on long coding sessions, especially compared with a simpler "summarize old messages" baseline.

This issue is about building that evidence in a way that other agents can run and compare over time.

Scope

  • Add long-session eval scenarios that stress:
    • heavy tool use
    • long back-and-forth coding sessions
    • resume/fork flows
    • compaction during active work
    • recovery after tool errors or interruptions
  • Add a simple baseline compaction mode for comparison in evals
  • Measure continuity and usefulness, for example:
    • does the agent remember the right files and goals?
    • does it recover cleanly after context refresh?
    • does it keep track of recent tool work?
  • Produce a human-readable report that makes regressions easy to spot

Acceptance Criteria

  • Eval scenarios are automated and reproducible
  • There is at least one baseline path to compare against
  • Results are summarized in plain language, not just raw counters
  • The repo includes guidance for running the evals locally and in CI

Starting Points

  • crates/tui/src/eval.rs
  • crates/tui/tests/eval_harness.rs
  • docs/capacity_controller.md
  • crates/tui/src/compaction.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions