Skip to content

dannywillowliu-uchi/parameter-golf-entry

Repository files navigation

Parameter Golf Entry: GPTQ-lite Quantization

Entry for the OpenAI Parameter Golf competition: train the best language model that fits in 16MB and trains in 10 minutes on 8xH100 SXM, scored by bits-per-byte (bpb) on the FineWeb validation set.

Current Best: 1.1257 bpb (sliding window, standard track, rank ~3)

Submission: PR #379

Approach

Built on PR #374's SOTA stack (1.1246 bpb) with a novel post-training quantization improvement.

Novel Contribution: GPTQ-lite

Standard int6 quantization uses row-wise absolute max for clipping. GPTQ-lite searches 5 clip percentiles per weight matrix (100%, 99.9%, 99.5%, 99%, 98%) and selects the one minimizing reconstruction error ||W - dequant(quant(W))||. This reduces quantization degradation at zero training cost. Zero new parameters, ~10s extra quantization time.

Architecture Stack

Technique Source Effect
11 layers, dim=512, MLP3x Community consensus More depth + wider MLP
Int6 QAT with STE PR #135 Better compression, MLP3x fits in 16MB
Tight SWA (scale<0.2, every 50 steps) PR #374 Zero-penalty weight averaging
XSA (Exclusive Self Attention) on last 4 layers PR #287 Removes self-value redundancy
Partial RoPE (16/64 dims) PR #315 Position-free attention on 75% of dims
LN Scale (1/sqrt(layer+1)) PR #315 Dampens deeper layers
Late QAT (enable at lr_scale < 0.1) PR #315 3x less quantization degradation
Value Embedding (shared, layers 9-10) PR #374 Re-injects token identity in deep layers
zstd-22 compression PR #86 ~15-20% smaller artifact than zlib-9
Muon optimizer, WD=0.04 Community consensus Better weight distributions
SmearGate + BigramHash(2048) PR #135 Token blending + bigram context
FA3 (FlashAttention 3, Hopper) Built from source ~85ms/step on H100
Sliding window eval (stride=64) PR #56 Better context per scored token
GPTQ-lite clip search Novel Per-layer optimal quantization

Run Log

All experiments on 8xH100 SXM via Modal. Sliding window bpb (stride=64).

Run Config bpb Steps ms/step Notes
baseline 9L vanilla 1.2272 12351 ~49 Starting point
int6_mlp3x 9L + int6/QAT/MLP3x 1.1657 8325 ~72 First competitive result
resformer_off (NGC 25.01) PR#287 base, SDPA fallback 1.1362 5975 100 FA3 unavailable
resformer_off (NGC 25.03) PR#287 base, FA2 1.1315 6060 99 Better torch.compile
diffattn (FA2) PR#315 + DiffAttn, 4 FA calls 1.1552 3882 154 Per-step quality great, too slow
diffattn (FA3) PR#315 + DiffAttn, FA3 1.2312 1863 322 FA3 kernel overhead worse
sdttt_gptq PR#374 + SDTTT + GPTQ, FA3 1.1260 6701 89.6 SDTTT hurt (-0.0003), GPTQ helped
gptq_only PR#374 + GPTQ, FA3 1.1257 6733 89.1 Best result

Key Lessons Learned

What worked

  • GPTQ-lite: Per-layer clip percentile search reduces quantization degradation. Simple, fast, effective.
  • FA3 from source: Building flash_attn_interface from the hopper/ subdirectory of Dao-AILab/flash-attention. Required for competitive step times.
  • Leaderboard scraping: Built leaderboard.py to track 200+ PRs and extract technique frequency. Critical for strategy.
  • Trusting community data: Other participants' validated results (3-seed) are reliable data points. Don't re-run what's already proven.

What didn't work

  • Differential Attention (4-call): 0.046 bpb better per step, but 4 FA calls = 154-322ms/step. Net negative. The architecture is right but the implementation is too expensive.
  • Even/odd head differential: Zero-cost DiffAttn variant. Crashed 3 times with FA3 + torch.compile + DDP incompatibilities.
  • Self-distillation TTT: KL-divergence adaptation post-training. Slightly negative (-0.0003 bpb). The KL constraint was too weak to improve on the already well-trained model.
  • ResFormer value residual: Zero params, zero cost, but neutral at 26M scale. May help at larger scales.
  • Depth recurrence: 50+ competition attempts by various participants, best result 0.044 behind SOTA. Unique layers beat shared layers at 16MB.
  • NorMuon: Training instability without warmup. Noisy initial EMA + weight decay = destructive feedback loop.
  • Too many changes at once: Stacking 6 changes simultaneously regressed badly. Always bisect.

Key insights

  • Quantization is the biggest untapped gap: SOTA loses 0.007 bpb to int6 quantization (~30% of total improvement over baseline). Most competition focus is on training quality.
  • Eval-time compute is unconstrained: The 10-min limit applies to training only. Post-training optimizations (GPTQ, TTT, SWA averaging) have unlimited time.
  • FA3 vs FA2 matters: ~15% step time difference = ~0.005 bpb from more training steps alone.
  • The competition converges fast: With 200+ participants, simple technique combinations are found within hours. Novel contributions need to be non-obvious.

Quick Start

See RUN_GUIDE.md for running on your own 8xH100 node.

# Clone and run
git clone https://github.com/dannywillowliu-uchi/parameter-golf-entry.git
cd parameter-golf-entry
GPTQ_ENABLED=1 torchrun --standalone --nproc_per_node=8 pr374_train_gpt.py

Files

File Purpose
pr374_train_gpt.py Main training script (PR #374 base + GPTQ-lite + SDTTT)
sweep.py Experiment orchestration (configs, Modal submission, logging)
modal_runner.py Modal cloud runner (NGC 25.03, FA3 build, 8xH100)
leaderboard.py Live leaderboard scraper with technique frequency analysis
results.jsonl All experiment results with full stdout/stderr
train_gpt.py Our earlier modified training script (int6, QAT, kernels)
kernels/ Custom Triton kernels (fused RMSNorm, cross-entropy, FP8)
RUN_GUIDE.md Setup guide for running on external GPU nodes

Competition

  • Competition: OpenAI Parameter Golf (March 18 - April 30, 2026)
  • Constraint: 16MB artifact, 10 min on 8xH100 SXM
  • Metric: Bits-per-byte (bpb) on FineWeb validation set
  • Leaderboard: parameter-golf.github.io
  • Current SOTA: 1.1246 bpb (PR #374)
  • Our position: 1.1257 bpb, rank ~3 (PR #379)

About

OpenAI Parameter Golf entry: Differential Attention + ResFormer on 11L int6 MLP3x stack. 1.1315 bpb, rank ~6/200+.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages