Skip to content

Improve token estimation accuracy: replace chars/3 heuristic with a real tokenizer #232

@Hmbown

Description

@Hmbown

Problem

The engine uses estimate_text_tokens_conservative() which divides character count by 3:

fn estimate_text_tokens_conservative(text: &str) -> usize {
    text.chars().count().div_ceil(3)
}

And estimate_tokens() is called from the compaction module, while estimate_input_tokens_conservative() multiplies the result by 1.5× as a safety factor:

fn estimate_input_tokens_conservative(messages: &[Message], system: Option<&SystemPrompt>) -> usize {
    let message_tokens = estimate_tokens(messages).saturating_mul(3).div_ceil(2);
    // ...
}

This matters because:

  1. False positives trigger unnecessary compaction — if the estimate is too high, the engine compacts context prematurely, losing information and wasting an LLM call.

  2. False negatives cause context overflow — if the estimate is too low, the API returns HTTP 400 with a context-length error, triggering emergency compaction that may fail.

  3. Capacity controller decisions are based on these estimates — the CapacityController uses estimated_input_tokens() to decide whether to defer, compact, or proceed.

  4. Cycle advancement uses token estimates to decide when to checkpoint-restart.

The chars/3 heuristic is particularly inaccurate for:

  • Code (variable names, operators, indentation — widely varying token density)
  • JSON payloads (structured data with many punctuation tokens)
  • Non-English text (CJK characters are often 1 character = multiple tokens)

Proposal

Integrate a lightweight tokenizer that matches DeepSeek's tokenizer. Options:

Option A: tiktoken-rs — Rust bindings to OpenAI's tiktoken. DeepSeek V4 likely uses a BPE tokenizer similar to GPT-4's cl100k_base. This would be the most accurate but adds a C dependency.

Option B: HuggingFace tokenizers crate — can load a DeepSeek-specific tokenizer.json. Most accurate but heaviest dependency.

Option C: Pure-Rust BPE implementation — implement a minimal BPE tokenizer against DeepSeek's published tokenizer config. No C dependency, but maintenance burden.

Option D: Hybrid approach — keep chars/3 as a fast path but run a tokenizer sample periodically to calibrate the ratio. Recalibrate every N turns.

Recommended approach

Start with Option A (tiktoken-rs with cl100k_base) behind a feature flag so it's opt-in. Run the existing chars/3 heuristic as a fallback when the tokenizer is unavailable. Add a benchmark comparing accuracy.

Add to [features] in the workspace Cargo.toml:

tokenizer = ["tiktoken-rs"]

Acceptance criteria

Metadata

Metadata

Assignees

No one assigned

    Labels

    contextContext management / contextenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions