Problem
The codebase has zero benchmarks. This is notable for a project where:
- Token estimation runs on every turn to decide whether to compact context — an inaccurate estimate causes unnecessary compaction or context overflow
- Stream parsing processes every SSE chunk from a 1M-token context window in real time
- Context compaction runs the LLM summarization API call and must complete before the next turn
- Tool result truncation runs on every tool execution for large-output tools
Without benchmarks, performance regressions are invisible until a user reports "the TUI feels sluggish on large sessions."
What to benchmark
1. Token estimation accuracy
Compare estimate_tokens() / estimate_input_tokens_conservative() against a real tokenizer (e.g., tiktoken-rs with the DeepSeek tokenizer). Measure:
- Mean absolute percentage error on representative message corpuses
- Wall-clock time per estimation (should be <1ms)
- False-positive and false-negative rates for the "should compact?" decision
2. Stream parsing throughput
Feed a large canned SSE stream (100K+ events simulating a long V4 thinking turn) through the stream parsing loop. Measure:
- Events processed per second
- Memory allocation profile
- Tail latency (p99 event processing time)
3. Context compaction end-to-end latency
With a mock LLM client, measure the wall-clock time from "compaction triggered" to "session messages replaced" for varying message counts (100, 500, 1000, 5000).
4. Tool result truncation
Benchmark compact_tool_result_for_context() and summarize_text_head_tail() on outputs of varying sizes (1KB to 10MB). These run on every tool execution.
Proposal
Add a benches/ directory at the workspace root with criterion benchmarks, gated behind a [[bench]] in Cargo.toml. Run them in CI as a non-blocking optional job (they can't gate merges because they need a stable baseline machine).
Use:
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
Acceptance criteria
- At least 3 benchmark groups exist and pass
cargo bench
- Token estimation benchmark includes a comparison against a real tokenizer
- Stream parsing benchmark processes >100K events without slowdown
- Results are documented in a
benches/README.md
Problem
The codebase has zero benchmarks. This is notable for a project where:
Without benchmarks, performance regressions are invisible until a user reports "the TUI feels sluggish on large sessions."
What to benchmark
1. Token estimation accuracy
Compare
estimate_tokens()/estimate_input_tokens_conservative()against a real tokenizer (e.g.,tiktoken-rswith the DeepSeek tokenizer). Measure:2. Stream parsing throughput
Feed a large canned SSE stream (100K+ events simulating a long V4 thinking turn) through the stream parsing loop. Measure:
3. Context compaction end-to-end latency
With a mock LLM client, measure the wall-clock time from "compaction triggered" to "session messages replaced" for varying message counts (100, 500, 1000, 5000).
4. Tool result truncation
Benchmark
compact_tool_result_for_context()andsummarize_text_head_tail()on outputs of varying sizes (1KB to 10MB). These run on every tool execution.Proposal
Add a
benches/directory at the workspace root with criterion benchmarks, gated behind a[[bench]]inCargo.toml. Run them in CI as a non-blocking optional job (they can't gate merges because they need a stable baseline machine).Use:
Acceptance criteria
cargo benchbenches/README.md