Skip to content

Record: Backoff N-gram Cache + LeakyReLU(0.9)² (val_bpb=0.6678)#806

Closed
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:submission/ngram-cache-0.6678
Closed

Record: Backoff N-gram Cache + LeakyReLU(0.9)² (val_bpb=0.6678)#806
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:submission/ngram-cache-0.6678

Conversation

@ibarrajo
Copy link
Copy Markdown

Summary

  • val_bpb: 0.6678 (seed 1337, additional seeds pending)
  • Multi-order backoff n-gram eval cache (orders 2-7) with entropy-adaptive alpha mixing
  • Distributed cache pre-fill for multi-GPU coherence (rank 7 pre-fills 54M tokens in 68s)
  • LeakyReLU(0.9)² activation (~0.013 BPB improvement over relu²)
  • Neural base: 1.1371 BPB (sliding window), n-gram cache: 0.6678 BPB
  • Artifact: 8.6MB (well under 16MB limit)
  • 8xH100 SXM, 7189 steps in 600s, eval in 200s

Key implementation details

  • Score-first legality: Every token scored under inference_mode() BEFORE cache update
  • Entropy-adaptive alpha: 0.05 + 0.55 * sigmoid(2*(H-4)) — no oracle/hindsight selection
  • Pre-fill: Each GPU rank pre-populates cache with all preceding tokens (pure numpy, no NCCL)
  • No pre-eval TTT — removed illegal pre-eval adaptation entirely

Results

Eval Method val_bpb
Non-overlapping (post-quant) 1.1594
Sliding window (stride=64) 1.1371
N-gram cache (orders 2-7) 0.6678

Test plan

  • Validated on 1xH100 (0.8556 BPB with undertrained model)
  • Full run on 8xH100 SXM (0.6678 BPB)
  • 2 additional seeds for statistical significance
  • Verify reproducibility from records/ folder

🤖 Generated with Claude Code

Multi-order backoff n-gram eval cache (orders 2-7) with entropy-adaptive
alpha mixing and distributed cache pre-fill for multi-GPU coherence.
Neural base 1.1371 BPB, n-gram cache drops to 0.6678. 8xH100 SXM,
7189 steps in 600s. Single seed (1337), additional seeds pending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Mar 26, 2026

Clean implementation — the distributed cache pre-fill solving the multi-GPU table fragmentation problem is a useful contribution, and the 8.6 MB artifact size gives you a lot of headroom.

Heads up: this currently has 1 seed. The leaderboard requires 3-seed validation with statistical significance for record claims. Just flagging so it's on your radar before review.


Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@ibarrajo
Copy link
Copy Markdown
Author

Closing: n-gram cache BPB scores are invalid due to normalization bug (only scores correct token without full-vocabulary normalization). N-gram caches were ruled illegal on March 27.

@ibarrajo ibarrajo closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants