Improve token estimation accuracy: replace chars/3 heuristic with a real tokenizer

## Problem

The engine uses `estimate_text_tokens_conservative()` which divides character count by 3:

```rust
fn estimate_text_tokens_conservative(text: &str) -> usize {
    text.chars().count().div_ceil(3)
}
```

And `estimate_tokens()` is called from the compaction module, while `estimate_input_tokens_conservative()` multiplies the result by 1.5× as a safety factor:

```rust
fn estimate_input_tokens_conservative(messages: &[Message], system: Option<&SystemPrompt>) -> usize {
    let message_tokens = estimate_tokens(messages).saturating_mul(3).div_ceil(2);
    // ...
}
```

This matters because:

1. **False positives trigger unnecessary compaction** — if the estimate is too high, the engine compacts context prematurely, losing information and wasting an LLM call.

2. **False negatives cause context overflow** — if the estimate is too low, the API returns HTTP 400 with a context-length error, triggering emergency compaction that may fail.

3. **Capacity controller decisions are based on these estimates** — the `CapacityController` uses `estimated_input_tokens()` to decide whether to defer, compact, or proceed.

4. **Cycle advancement** uses token estimates to decide when to checkpoint-restart.

The chars/3 heuristic is particularly inaccurate for:
- Code (variable names, operators, indentation — widely varying token density)
- JSON payloads (structured data with many punctuation tokens)
- Non-English text (CJK characters are often 1 character = multiple tokens)

## Proposal

Integrate a lightweight tokenizer that matches DeepSeek's tokenizer. Options:

**Option A: `tiktoken-rs`** — Rust bindings to OpenAI's tiktoken. DeepSeek V4 likely uses a BPE tokenizer similar to GPT-4's `cl100k_base`. This would be the most accurate but adds a C dependency.

**Option B: HuggingFace `tokenizers` crate** — can load a DeepSeek-specific `tokenizer.json`. Most accurate but heaviest dependency.

**Option C: Pure-Rust BPE implementation** — implement a minimal BPE tokenizer against DeepSeek's published tokenizer config. No C dependency, but maintenance burden.

**Option D: Hybrid approach** — keep chars/3 as a fast path but run a tokenizer sample periodically to calibrate the ratio. Recalibrate every N turns.

## Recommended approach

Start with Option A (`tiktoken-rs` with `cl100k_base`) behind a feature flag so it's opt-in. Run the existing chars/3 heuristic as a fallback when the tokenizer is unavailable. Add a benchmark comparing accuracy.

Add to `[features]` in the workspace `Cargo.toml`:
```toml
tokenizer = ["tiktoken-rs"]
```

## Acceptance criteria

- Token estimation uses a real tokenizer when the `tokenizer` feature is enabled
- Estimation error rate is <5% on a representative corpus of 100+ messages
- The chars/3 fallback still works when the feature is disabled
- A benchmark compares the two approaches (see issue #230)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve token estimation accuracy: replace chars/3 heuristic with a real tokenizer #232

Problem

Proposal

Recommended approach

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve token estimation accuracy: replace chars/3 heuristic with a real tokenizer #232

Description

Problem

Proposal

Recommended approach

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions