Five Whys
- Why is QLoRA training stuck for 25+ minutes before first epoch? Because
prepare_samples() pre-tokenizes 15,494 samples sequentially on CPU.
- Why does pre-tokenization take so long? Because
bpe() in qwen2bpe_tokenizer.rs:341-374 uses O(n²) algorithm per word.
- Why is the algorithm O(n²)? Because every iteration rescans the entire token list to find the lowest-rank merge, then uses
Vec::splice() to shift the array.
- Why does it rescan instead of using a priority queue? Because the original implementation was a naïve textbook BPE without the standard optimization.
- Why wasn't this caught earlier? Because our benchmark (1.8M tokens/sec on 265-token words) masked the quadratic blowup — the bottleneck only appears at scale (15K samples × multiple words).
Root Cause
Three algorithmic defects in bpe():
| Defect |
Current |
Reference (HuggingFace tokenizers) |
| Merge selection |
O(n) full rescan per merge |
O(log n) BinaryHeap pop |
| Pair lookup key |
HashMap<(String, String), usize> — 2 heap allocs per probe |
HashMap<(u32, u32), (u32, u32)> — zero alloc |
| Merge application |
Vec::splice() — O(n) array shift |
Doubly-linked list pointer update — O(1) |
Fix
Port the HuggingFace tokenizers word.rs algorithm:
- Add
merge_id_map: HashMap<(u32, u32), (u32, u32)> (integer pair → (rank, merged_id))
- Rewrite
bpe() with doubly-linked symbol list + BinaryHeap min-heap
- Lazy validation of stale queue entries (same as HF approach)
Acceptance Criteria
Five Whys
prepare_samples()pre-tokenizes 15,494 samples sequentially on CPU.bpe()inqwen2bpe_tokenizer.rs:341-374uses O(n²) algorithm per word.Vec::splice()to shift the array.Root Cause
Three algorithmic defects in
bpe():tokenizers)HashMap<(String, String), usize>— 2 heap allocs per probeHashMap<(u32, u32), (u32, u32)>— zero allocVec::splice()— O(n) array shiftFix
Port the HuggingFace
tokenizersword.rs algorithm:merge_id_map: HashMap<(u32, u32), (u32, u32)>(integer pair → (rank, merged_id))bpe()with doubly-linked symbol list + BinaryHeap min-heapAcceptance Criteria