Skip to content

Add criterion benchmarks for critical paths: token estimation, stream parsing, context compaction #230

@Hmbown

Description

@Hmbown

Problem

The codebase has zero benchmarks. This is notable for a project where:

  • Token estimation runs on every turn to decide whether to compact context — an inaccurate estimate causes unnecessary compaction or context overflow
  • Stream parsing processes every SSE chunk from a 1M-token context window in real time
  • Context compaction runs the LLM summarization API call and must complete before the next turn
  • Tool result truncation runs on every tool execution for large-output tools

Without benchmarks, performance regressions are invisible until a user reports "the TUI feels sluggish on large sessions."

What to benchmark

1. Token estimation accuracy

Compare estimate_tokens() / estimate_input_tokens_conservative() against a real tokenizer (e.g., tiktoken-rs with the DeepSeek tokenizer). Measure:

  • Mean absolute percentage error on representative message corpuses
  • Wall-clock time per estimation (should be <1ms)
  • False-positive and false-negative rates for the "should compact?" decision

2. Stream parsing throughput

Feed a large canned SSE stream (100K+ events simulating a long V4 thinking turn) through the stream parsing loop. Measure:

  • Events processed per second
  • Memory allocation profile
  • Tail latency (p99 event processing time)

3. Context compaction end-to-end latency

With a mock LLM client, measure the wall-clock time from "compaction triggered" to "session messages replaced" for varying message counts (100, 500, 1000, 5000).

4. Tool result truncation

Benchmark compact_tool_result_for_context() and summarize_text_head_tail() on outputs of varying sizes (1KB to 10MB). These run on every tool execution.

Proposal

Add a benches/ directory at the workspace root with criterion benchmarks, gated behind a [[bench]] in Cargo.toml. Run them in CI as a non-blocking optional job (they can't gate merges because they need a stable baseline machine).

Use:

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

Acceptance criteria

  • At least 3 benchmark groups exist and pass cargo bench
  • Token estimation benchmark includes a comparison against a real tokenizer
  • Stream parsing benchmark processes >100K events without slowdown
  • Results are documented in a benches/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions