BPE tokenizer uses O(n²) greedy rescan instead of priority queue

## Five Whys

1. **Why is QLoRA training stuck for 25+ minutes before first epoch?** Because `prepare_samples()` pre-tokenizes 15,494 samples sequentially on CPU.
2. **Why does pre-tokenization take so long?** Because `bpe()` in `qwen2bpe_tokenizer.rs:341-374` uses O(n²) algorithm per word.
3. **Why is the algorithm O(n²)?** Because every iteration rescans the entire token list to find the lowest-rank merge, then uses `Vec::splice()` to shift the array.
4. **Why does it rescan instead of using a priority queue?** Because the original implementation was a naïve textbook BPE without the standard optimization.
5. **Why wasn't this caught earlier?** Because our benchmark (1.8M tokens/sec on 265-token words) masked the quadratic blowup — the bottleneck only appears at scale (15K samples × multiple words).

## Root Cause

Three algorithmic defects in `bpe()`:

| Defect | Current | Reference (HuggingFace `tokenizers`) |
|--------|---------|--------------------------------------|
| Merge selection | O(n) full rescan per merge | O(log n) BinaryHeap pop |
| Pair lookup key | `HashMap<(String, String), usize>` — 2 heap allocs per probe | `HashMap<(u32, u32), (u32, u32)>` — zero alloc |
| Merge application | `Vec::splice()` — O(n) array shift | Doubly-linked list pointer update — O(1) |

## Fix

Port the HuggingFace `tokenizers` word.rs algorithm:
1. Add `merge_id_map: HashMap<(u32, u32), (u32, u32)>` (integer pair → (rank, merged_id))
2. Rewrite `bpe()` with doubly-linked symbol list + BinaryHeap min-heap
3. Lazy validation of stale queue entries (same as HF approach)

## Acceptance Criteria

- [ ] All existing BPE tests pass (FALSIFY-BPE-001..009, unit tests, HF loading tests)
- [ ] bench_bpe shows measurable throughput improvement
- [ ] Same encode output for Qwen3 tokenizer (correctness preserved)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE tokenizer uses O(n²) greedy rescan instead of priority queue #378

Five Whys

Root Cause

Fix

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Defect	Current	Reference (HuggingFace `tokenizers`)
Merge selection	O(n) full rescan per merge	O(log n) BinaryHeap pop
Pair lookup key	`HashMap<(String, String), usize>` — 2 heap allocs per probe	`HashMap<(u32, u32), (u32, u32)>` — zero alloc
Merge application	`Vec::splice()` — O(n) array shift	Doubly-linked list pointer update — O(1)

BPE tokenizer uses O(n²) greedy rescan instead of priority queue #378

Description

Five Whys

Root Cause

Fix

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions