Problem
The engine uses estimate_text_tokens_conservative() which divides character count by 3:
fn estimate_text_tokens_conservative(text: &str) -> usize {
text.chars().count().div_ceil(3)
}
And estimate_tokens() is called from the compaction module, while estimate_input_tokens_conservative() multiplies the result by 1.5× as a safety factor:
fn estimate_input_tokens_conservative(messages: &[Message], system: Option<&SystemPrompt>) -> usize {
let message_tokens = estimate_tokens(messages).saturating_mul(3).div_ceil(2);
// ...
}
This matters because:
-
False positives trigger unnecessary compaction — if the estimate is too high, the engine compacts context prematurely, losing information and wasting an LLM call.
-
False negatives cause context overflow — if the estimate is too low, the API returns HTTP 400 with a context-length error, triggering emergency compaction that may fail.
-
Capacity controller decisions are based on these estimates — the CapacityController uses estimated_input_tokens() to decide whether to defer, compact, or proceed.
-
Cycle advancement uses token estimates to decide when to checkpoint-restart.
The chars/3 heuristic is particularly inaccurate for:
- Code (variable names, operators, indentation — widely varying token density)
- JSON payloads (structured data with many punctuation tokens)
- Non-English text (CJK characters are often 1 character = multiple tokens)
Proposal
Integrate a lightweight tokenizer that matches DeepSeek's tokenizer. Options:
Option A: tiktoken-rs — Rust bindings to OpenAI's tiktoken. DeepSeek V4 likely uses a BPE tokenizer similar to GPT-4's cl100k_base. This would be the most accurate but adds a C dependency.
Option B: HuggingFace tokenizers crate — can load a DeepSeek-specific tokenizer.json. Most accurate but heaviest dependency.
Option C: Pure-Rust BPE implementation — implement a minimal BPE tokenizer against DeepSeek's published tokenizer config. No C dependency, but maintenance burden.
Option D: Hybrid approach — keep chars/3 as a fast path but run a tokenizer sample periodically to calibrate the ratio. Recalibrate every N turns.
Recommended approach
Start with Option A (tiktoken-rs with cl100k_base) behind a feature flag so it's opt-in. Run the existing chars/3 heuristic as a fallback when the tokenizer is unavailable. Add a benchmark comparing accuracy.
Add to [features] in the workspace Cargo.toml:
tokenizer = ["tiktoken-rs"]
Acceptance criteria
Problem
The engine uses
estimate_text_tokens_conservative()which divides character count by 3:And
estimate_tokens()is called from the compaction module, whileestimate_input_tokens_conservative()multiplies the result by 1.5× as a safety factor:This matters because:
False positives trigger unnecessary compaction — if the estimate is too high, the engine compacts context prematurely, losing information and wasting an LLM call.
False negatives cause context overflow — if the estimate is too low, the API returns HTTP 400 with a context-length error, triggering emergency compaction that may fail.
Capacity controller decisions are based on these estimates — the
CapacityControllerusesestimated_input_tokens()to decide whether to defer, compact, or proceed.Cycle advancement uses token estimates to decide when to checkpoint-restart.
The chars/3 heuristic is particularly inaccurate for:
Proposal
Integrate a lightweight tokenizer that matches DeepSeek's tokenizer. Options:
Option A:
tiktoken-rs— Rust bindings to OpenAI's tiktoken. DeepSeek V4 likely uses a BPE tokenizer similar to GPT-4'scl100k_base. This would be the most accurate but adds a C dependency.Option B: HuggingFace
tokenizerscrate — can load a DeepSeek-specifictokenizer.json. Most accurate but heaviest dependency.Option C: Pure-Rust BPE implementation — implement a minimal BPE tokenizer against DeepSeek's published tokenizer config. No C dependency, but maintenance burden.
Option D: Hybrid approach — keep chars/3 as a fast path but run a tokenizer sample periodically to calibrate the ratio. Recalibrate every N turns.
Recommended approach
Start with Option A (
tiktoken-rswithcl100k_base) behind a feature flag so it's opt-in. Run the existing chars/3 heuristic as a fallback when the tokenizer is unavailable. Add a benchmark comparing accuracy.Add to
[features]in the workspaceCargo.toml:Acceptance criteria
tokenizerfeature is enabled