Add criterion benchmarks for critical paths: token estimation, stream parsing, context compaction

## Problem

The codebase has zero benchmarks. This is notable for a project where:
- **Token estimation** runs on every turn to decide whether to compact context — an inaccurate estimate causes unnecessary compaction or context overflow
- **Stream parsing** processes every SSE chunk from a 1M-token context window in real time
- **Context compaction** runs the LLM summarization API call and must complete before the next turn
- **Tool result truncation** runs on every tool execution for large-output tools

Without benchmarks, performance regressions are invisible until a user reports "the TUI feels sluggish on large sessions."

## What to benchmark

### 1. Token estimation accuracy
Compare `estimate_tokens()` / `estimate_input_tokens_conservative()` against a real tokenizer (e.g., `tiktoken-rs` with the DeepSeek tokenizer). Measure:
- Mean absolute percentage error on representative message corpuses
- Wall-clock time per estimation (should be <1ms)
- False-positive and false-negative rates for the "should compact?" decision

### 2. Stream parsing throughput
Feed a large canned SSE stream (100K+ events simulating a long V4 thinking turn) through the stream parsing loop. Measure:
- Events processed per second
- Memory allocation profile
- Tail latency (p99 event processing time)

### 3. Context compaction end-to-end latency
With a mock LLM client, measure the wall-clock time from "compaction triggered" to "session messages replaced" for varying message counts (100, 500, 1000, 5000).

### 4. Tool result truncation
Benchmark `compact_tool_result_for_context()` and `summarize_text_head_tail()` on outputs of varying sizes (1KB to 10MB). These run on every tool execution.

## Proposal

Add a `benches/` directory at the workspace root with criterion benchmarks, gated behind a `[[bench]]` in `Cargo.toml`. Run them in CI as a non-blocking optional job (they can't gate merges because they need a stable baseline machine).

Use:
```toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
```

## Acceptance criteria

- At least 3 benchmark groups exist and pass `cargo bench`
- Token estimation benchmark includes a comparison against a real tokenizer
- Stream parsing benchmark processes >100K events without slowdown
- Results are documented in a `benches/README.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add criterion benchmarks for critical paths: token estimation, stream parsing, context compaction #230

Problem

What to benchmark

1. Token estimation accuracy

2. Stream parsing throughput

3. Context compaction end-to-end latency

4. Tool result truncation

Proposal

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add criterion benchmarks for critical paths: token estimation, stream parsing, context compaction #230

Description

Problem

What to benchmark

1. Token estimation accuracy

2. Stream parsing throughput

3. Context compaction end-to-end latency

4. Tool result truncation

Proposal

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions