Skip to content

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970)#738

Closed
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/VRL-FullGPTQ-NgramKNN-1.0970
Closed

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970)#738
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/VRL-FullGPTQ-NgramKNN-1.0970

Conversation

@gowtham0992
Copy link
Copy Markdown

@gowtham0992 gowtham0992 commented Mar 25, 2026

Summary

3-seed mean val_bpb: 1.0970 (std 0.0006) | ≤15.74 MB | 8×H100 SXM, 598s training | No TTT

Key Innovations

Hidden-State kNN-LM (novel — first in competition): Stores 512-dim hidden states from already-scored tokens in a GPU ring buffer. For uncertain tokens, finds k=32 nearest neighbors by L2 distance and builds a non-parametric distribution via RBF kernel. Based on Khandelwal et al. 2019 (ICLR 2020). Captures semantic repetition that n-grams miss. Additive -0.007 BPB on top of n-gram cache.

Online 5-gram cache with adaptive lambda: Backward-looking n-gram cache with backoff. Pre-committed confidence gate (no safety gate / oracle selection per #677 ruling). Adaptive lambda scales mixing weight by model uncertainty.

Results

Seed Post-cache bpb Artifact
42 1.0976 15.68 MB
1337 1.0965 15.74 MB
2024 1.0970 15.55 MB
Mean 1.0970

vs SOTA (1.1194): improvement = 0.0224 nats

Compliance (per #677)

  • GPTQ calibration inside 600s training budget (total_train_time ~598s in all logs)
  • No safety gate / oracle selection — pre-committed confidence gate
  • No training data accessed at eval time
  • N-gram + kNN caches strictly backward-looking
  • All artifacts under 16MB, all eval under 600s

Reproduction

pip install sentencepiece zstandard 
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 --no-deps 
python3 data/cached_challenge_fineweb.py --variant sp1024 
SEED=42 python3 -m torch.distributed.run --nproc_per_node=8 train_gpt.py



## Credits

- kNN-LM: Khandelwal et al. 2019 (ICLR 2020)
- 5-gram cache concept: PR #659 by @deanbrr (our implementation uses pre-committed gate)
- VRL: arxiv:2410.17897, PR #569 by @gowtham0992
- Full GPTQ: IST-DASLab/gptq (ICLR 2023)
- LeakyReLU²: PR #493 by @parinzee
- Base stack: PR #414 by @signalrush

@gowtham0992 gowtham0992 force-pushed the submission/VRL-FullGPTQ-NgramKNN-1.0970 branch from e2b0749 to c458cef Compare March 25, 2026 16:34
MichaelMcCulloch pushed a commit to MichaelMcCulloch/parameter-golf that referenced this pull request Mar 29, 2026
- Submission train_gpt.py with all 32 techniques from the execution plan,
  each gated by environment variables (disabled by default)
- Optuna-based search framework with validate mode (per-technique smoke test)
  and search mode (TPE over joint technique + model size space)
- Ablation infrastructure (ablation.py, shell scripts) for tracking experiments
- PR source files for reference (openai#505, openai#569, openai#576, openai#727, openai#738)
- Execution plan document

Techniques span architecture (activations, HybridNorm, SmearGate, DiffAttn,
PoPE, WaveletGPT, VGA, XSA), training (EMA, SWA, QAT, MTP), quantization
(variable bit-width, OptRot, GPTQ, pruning, entropy coding), and eval-time
(TTT-LoRA, n-gram cache, kNN-LM, TurboQuant KV compression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant