Gravity Tokenizer: 1.0321 BPB via ablation leverage vocabulary optimization#755
Gravity Tokenizer: 1.0321 BPB via ablation leverage vocabulary optimization#755dcrow85 wants to merge 2 commits intoopenai:mainfrom
Conversation
…zation Replaces 659/765 merge tokens by structural importance scoring. Vanilla 12L 384d transformer, no architectural novelties. 3-seed mean: 1.0321 (std 0.0011). All artifacts under 16MB, all runs under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- The horizontal lensing hypothesis was tested and killed (56% RoPE artifact) - Replaced with the depth efficiency law (p=0.00005, length-matched) - Added Qwen 2.5-72B frontier probe results (80 layers, same physics) - Link to full probe data and DEPTH_EFFICIENCY.md writeup - Honest framing: reported what survived the controls and what didn't Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This is another one that I struggled to find an issue with for a while, but upon closer inspection of the tokenization, the reported Also, @NoesisGenesis, how does this one fit into your 4 categories? Some additional details from Opus: The gravity tokenizer lacks a standalone Since BPB = |
|
All four information-theoretic conditions are satisfied. This submission just requires correct BPB computation. |
Summary
3-Seed Results
Approach
At 1024 vocabulary tokens, every merge slot matters. Standard BPE allocates by frequency. The Gravity Tokenizer allocates by ablation leverage — the downstream loss increase when a token is shattered back to bytes. This is a measurement of structural importance, not frequency.
The scoring pipeline uses a frozen GPT-2 reference model to measure each candidate token's leverage across 100 FineWeb contexts. The top 765 candidates by gravity score replace the BPE merge tokens. The vocabulary size stays exactly 1024.
Tokenizer Correctness
The val_bpb calculation uses the competition's own
build_sentencepiece_luts()andeval_val()functions with zero modifications. The gravity tokenizer's lower compression ratio (1.05 vs 2.45 bytes/token) results in a highertokens_per_bytemultiplier, which penalizes the gravity tokenizer. The improvement is entirely in per-token prediction quality. Detailed correctness documentation included intokenizer_scrutiny_doc.md.Setup
bash setup.sh # Downloads stock FineWeb + retokenizes with gravity vocabularyThe
train_gpt.pyis the unmodified competition baseline. All config via env vars.Test plan
🤖 Generated with Claude Code