Parameter Golf Entry: GPTQ-lite Quantization

Entry for the OpenAI Parameter Golf competition: train the best language model that fits in 16MB and trains in 10 minutes on 8xH100 SXM, scored by bits-per-byte (bpb) on the FineWeb validation set.

Current Best: 1.1257 bpb (sliding window, standard track, rank ~3)

Submission: PR #379

Approach

Built on PR #374's SOTA stack (1.1246 bpb) with a novel post-training quantization improvement.

Novel Contribution: GPTQ-lite

Standard int6 quantization uses row-wise absolute max for clipping. GPTQ-lite searches 5 clip percentiles per weight matrix (100%, 99.9%, 99.5%, 99%, 98%) and selects the one minimizing reconstruction error ||W - dequant(quant(W))||. This reduces quantization degradation at zero training cost. Zero new parameters, ~10s extra quantization time.

Architecture Stack

Technique	Source	Effect
11 layers, dim=512, MLP3x	Community consensus	More depth + wider MLP
Int6 QAT with STE	PR #135	Better compression, MLP3x fits in 16MB
Tight SWA (scale<0.2, every 50 steps)	PR #374	Zero-penalty weight averaging
XSA (Exclusive Self Attention) on last 4 layers	PR #287	Removes self-value redundancy
Partial RoPE (16/64 dims)	PR #315	Position-free attention on 75% of dims
LN Scale (1/sqrt(layer+1))	PR #315	Dampens deeper layers
Late QAT (enable at lr_scale < 0.1)	PR #315	3x less quantization degradation
Value Embedding (shared, layers 9-10)	PR #374	Re-injects token identity in deep layers
zstd-22 compression	PR #86	~15-20% smaller artifact than zlib-9
Muon optimizer, WD=0.04	Community consensus	Better weight distributions
SmearGate + BigramHash(2048)	PR #135	Token blending + bigram context
FA3 (FlashAttention 3, Hopper)	Built from source	~85ms/step on H100
Sliding window eval (stride=64)	PR #56	Better context per scored token
GPTQ-lite clip search	Novel	Per-layer optimal quantization

Run Log

All experiments on 8xH100 SXM via Modal. Sliding window bpb (stride=64).

Run	Config	bpb	Steps	ms/step	Notes
baseline	9L vanilla	1.2272	12351	~49	Starting point
int6_mlp3x	9L + int6/QAT/MLP3x	1.1657	8325	~72	First competitive result
resformer_off (NGC 25.01)	PR#287 base, SDPA fallback	1.1362	5975	100	FA3 unavailable
resformer_off (NGC 25.03)	PR#287 base, FA2	1.1315	6060	99	Better torch.compile
diffattn (FA2)	PR#315 + DiffAttn, 4 FA calls	1.1552	3882	154	Per-step quality great, too slow
diffattn (FA3)	PR#315 + DiffAttn, FA3	1.2312	1863	322	FA3 kernel overhead worse
sdttt_gptq	PR#374 + SDTTT + GPTQ, FA3	1.1260	6701	89.6	SDTTT hurt (-0.0003), GPTQ helped
gptq_only	PR#374 + GPTQ, FA3	1.1257	6733	89.1	Best result

Key Lessons Learned

What worked

GPTQ-lite: Per-layer clip percentile search reduces quantization degradation. Simple, fast, effective.
FA3 from source: Building flash_attn_interface from the hopper/ subdirectory of Dao-AILab/flash-attention. Required for competitive step times.
Leaderboard scraping: Built leaderboard.py to track 200+ PRs and extract technique frequency. Critical for strategy.
Trusting community data: Other participants' validated results (3-seed) are reliable data points. Don't re-run what's already proven.

What didn't work

Differential Attention (4-call): 0.046 bpb better per step, but 4 FA calls = 154-322ms/step. Net negative. The architecture is right but the implementation is too expensive.
Even/odd head differential: Zero-cost DiffAttn variant. Crashed 3 times with FA3 + torch.compile + DDP incompatibilities.
Self-distillation TTT: KL-divergence adaptation post-training. Slightly negative (-0.0003 bpb). The KL constraint was too weak to improve on the already well-trained model.
ResFormer value residual: Zero params, zero cost, but neutral at 26M scale. May help at larger scales.
Depth recurrence: 50+ competition attempts by various participants, best result 0.044 behind SOTA. Unique layers beat shared layers at 16MB.
NorMuon: Training instability without warmup. Noisy initial EMA + weight decay = destructive feedback loop.
Too many changes at once: Stacking 6 changes simultaneously regressed badly. Always bisect.

Key insights

Quantization is the biggest untapped gap: SOTA loses 0.007 bpb to int6 quantization (~30% of total improvement over baseline). Most competition focus is on training quality.
Eval-time compute is unconstrained: The 10-min limit applies to training only. Post-training optimizations (GPTQ, TTT, SWA averaging) have unlimited time.
FA3 vs FA2 matters: ~15% step time difference = ~0.005 bpb from more training steps alone.
The competition converges fast: With 200+ participants, simple technique combinations are found within hours. Novel contributions need to be non-obvious.

Quick Start

See RUN_GUIDE.md for running on your own 8xH100 node.

# Clone and run
git clone https://github.com/dannywillowliu-uchi/parameter-golf-entry.git
cd parameter-golf-entry
GPTQ_ENABLED=1 torchrun --standalone --nproc_per_node=8 pr374_train_gpt.py

Files

File	Purpose
`pr374_train_gpt.py`	Main training script (PR #374 base + GPTQ-lite + SDTTT)
`sweep.py`	Experiment orchestration (configs, Modal submission, logging)
`modal_runner.py`	Modal cloud runner (NGC 25.03, FA3 build, 8xH100)
`leaderboard.py`	Live leaderboard scraper with technique frequency analysis
`results.jsonl`	All experiment results with full stdout/stderr
`train_gpt.py`	Our earlier modified training script (int6, QAT, kernels)
`kernels/`	Custom Triton kernels (fused RMSNorm, cross-entropy, FP8)
`RUN_GUIDE.md`	Setup guide for running on external GPU nodes

Competition

Competition: OpenAI Parameter Golf (March 18 - April 30, 2026)
Constraint: 16MB artifact, 10 min on 8xH100 SXM
Metric: Bits-per-byte (bpb) on FineWeb validation set
Leaderboard: parameter-golf.github.io
Current SOTA: 1.1246 bpb (PR #374)
Our position: 1.1257 bpb, rank ~3 (PR #379)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
kernels		kernels
.gitignore		.gitignore
README.md		README.md
RUN_GUIDE.md		RUN_GUIDE.md
fineweb_tokenize.py		fineweb_tokenize.py
leaderboard.py		leaderboard.py
modal_runner.py		modal_runner.py
pr1089_train_gpt.py		pr1089_train_gpt.py
pr287_train_gpt.py		pr287_train_gpt.py
pr315_train_gpt.py		pr315_train_gpt.py
pr374_train_gpt.py		pr374_train_gpt.py
profile_run.py		profile_run.py
results.jsonl		results.jsonl
run_verify.py		run_verify.py
sweep.py		sweep.py
train_gpt.py		train_gpt.py
verify_kernels.py		verify_kernels.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameter Golf Entry: GPTQ-lite Quantization

Current Best: 1.1257 bpb (sliding window, standard track, rank ~3)

Approach

Novel Contribution: GPTQ-lite

Architecture Stack

Run Log

Key Lessons Learned

What worked

What didn't work

Key insights

Quick Start

Files

Competition

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parameter Golf Entry: GPTQ-lite Quantization

Current Best: 1.1257 bpb (sliding window, standard track, rank ~3)

Approach

Novel Contribution: GPTQ-lite

Architecture Stack

Run Log

Key Lessons Learned

What worked

What didn't work

Key insights

Quick Start

Files

Competition

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages