Skip to content

BPE tokenizer uses O(n²) greedy rescan instead of priority queue #378

@noahgift

Description

@noahgift

Five Whys

  1. Why is QLoRA training stuck for 25+ minutes before first epoch? Because prepare_samples() pre-tokenizes 15,494 samples sequentially on CPU.
  2. Why does pre-tokenization take so long? Because bpe() in qwen2bpe_tokenizer.rs:341-374 uses O(n²) algorithm per word.
  3. Why is the algorithm O(n²)? Because every iteration rescans the entire token list to find the lowest-rank merge, then uses Vec::splice() to shift the array.
  4. Why does it rescan instead of using a priority queue? Because the original implementation was a naïve textbook BPE without the standard optimization.
  5. Why wasn't this caught earlier? Because our benchmark (1.8M tokens/sec on 265-token words) masked the quadratic blowup — the bottleneck only appears at scale (15K samples × multiple words).

Root Cause

Three algorithmic defects in bpe():

Defect Current Reference (HuggingFace tokenizers)
Merge selection O(n) full rescan per merge O(log n) BinaryHeap pop
Pair lookup key HashMap<(String, String), usize> — 2 heap allocs per probe HashMap<(u32, u32), (u32, u32)> — zero alloc
Merge application Vec::splice() — O(n) array shift Doubly-linked list pointer update — O(1)

Fix

Port the HuggingFace tokenizers word.rs algorithm:

  1. Add merge_id_map: HashMap<(u32, u32), (u32, u32)> (integer pair → (rank, merged_id))
  2. Rewrite bpe() with doubly-linked symbol list + BinaryHeap min-heap
  3. Lazy validation of stale queue entries (same as HF approach)

Acceptance Criteria

  • All existing BPE tests pass (FALSIFY-BPE-001..009, unit tests, HF loading tests)
  • bench_bpe shows measurable throughput improvement
  • Same encode output for Qwen3 tokenizer (correctness preserved)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions