11L 512d Int8+Zlib Baseline (val_bpb 1.2135, 3-seed) by nickferrantelive · Pull Request #858 · openai/parameter-golf

nickferrantelive · 2026-03-26T15:17:15Z

Record: 11L 512d Int8+Zlib Baseline

val_bpb: 1.2135 (3-seed mean) | 15.54 MB (mean) | 8xH100 SXM, 599s

Summary

Baseline train_gpt.py with NUM_LAYERS=11 (up from the default 9). All other hyperparameters are stock defaults. This submission demonstrates the baseline architecture properly scaled with additional depth on 8xH100 SXM hardware.

Changes from Naive Baseline

Change	Baseline	This	Impact
Layers	9	11	+2 layers (20.7M vs ~17M params)
Everything else	Default	Default	No other changes

Results (3 seeds, 8xH100 SXM)

Seed	Steps	val_loss	val_bpb	Artifact
1337	11,181	2.0484	1.2132	15.54 MB
42	11,185	2.0490	1.2135	15.54 MB
2025	11,182	2.0493	1.2137	15.54 MB

Mean: 1.2135 | Std: 0.0003

Architecture

11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
2x MLP expansion (1024 hidden)
U-Net skip connections (5 encoder, 6 decoder)
Tied embeddings, logit softcap=30.0
Vocab size 1024 (SentencePiece BPE)
Muon optimizer, int8+zlib quantization
Total artifact: 15,541,950 bytes (well under 16MB cap)

Run Command

NUM_LAYERS=11 SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Notes

This is a non-SOTA submission demonstrating baseline scaling. We have several novel techniques in development (TTT, GPTQ-lite, SmearGate, BigramHash, PolarQuant, hybrid RNN-attention architectures) that we plan to submit as improved records.

Checklist

README.md with detailed explanation
submission.json with metadata
Training logs for 3 seeds
train_gpt.py (stock baseline, compiles and runs)
All seeds under 16MB and under 10 minutes on 8xH100 SXM

… flash/compile

…(1,rows/cols)^0.5)

…eddings)

…ghts for int8)

…1; add step12 GPTQ-lite mixed int6/int8 + zstd quantization

…removed (7x too slow). All 8 models done.

…g test also worse. Step 5 stays best.

…context signal)

…ted to 8xH100. M3 projects best (1.24), M1 Codec is natural A+B hybrid

…s): full component inventory, compatibility matrix, budget analysis, and 7 strategic questions for multi-LLM analysis

…sting

…(3 hybrid) = 9 models

…ompetition leaders)

…b as input)

… as input signal

…al expand on others) + ready for 768d

…ob features → NGramContextProj)

…768d ready

… enable_gqa try/except

…, entropy-adaptive alpha)

…n 3500 + grad clip 0.3

- 11 transformer layers (up from baseline 9), 512d, 8 heads, 4 KV heads - U-Net skip connections, Muon optimizer, tied embeddings - Int8 per-row quantization + zlib compression - 3-seed verification: 1.2132, 1.2135, 1.2137 (std=0.0003) - All seeds under 16MB (15.54MB), under 10min (599s) on 8xH100 SXM

…rant response

Nick Ferrante added 30 commits March 24, 2026 22:58

Add T4-compatible baseline (enable_gqa=False)

826ed2d

Disable torch.compile for T4 compatibility

fdb9c6e

T4 compat: disable gqa and torch.compile

b6b0b4c

Disable last torch.compile

c886d6d

T4 compat: disable flash attention, use mem_efficient+math SDP backends

9b2164a

Fix GQA: manually repeat KV heads for T4 compat

828b42f

T4 compat patches for all models: disable GQA, add KV repeat, disable…

880d7c4

… flash/compile

Step 1: baseline + 11 layers (single change)

09c26c2

Step 2: 11 layers + 3x MLP

c88e9b4

Step 3: + EMA (decay=0.997)

688edf4

Step 4: + warmdown 3500

9c91934

Step 5: + seq_len 2048

49ca24a

Step 6: + batch 786432

8efd098

Step 7: + Late QAT (STE int6 fake-quant when lr_scale < 0.15)

83381f9

Step 8: + OrthoInit for matrix weights (orthogonal init scaled by max…

04b9916

…(1,rows/cols)^0.5)

Step 9: + BigramHash (2048 buckets, dim=128, learned bigram context)

156971d

Step 10: + SmearGate (learnable blend of current + previous token emb…

f2fa199

…eddings)

Step 8 fix: OrthoInit without Muon scale (gain=1.0 to not inflate wei…

ecc3c7d

…ghts for int8)

Fix OrthoInit in step9/10: gain=1.0 to not inflate weights

0d5b3b6

Step 11: + sliding window eval (stride=64) with forward_logits method

4b403d3

Fix late QAT smoke-test trigger (add min_step=500 guard) in steps 7-1…

d4acb2d

…1; add step12 GPTQ-lite mixed int6/int8 + zstd quantization

Step 7: + grad clip 0.3

581deb8

Step 8: + Muon weight decay 0.04

5a77f3a

Step 13: PolarQuant integration (Google TurboQuant, 3-4 bit, no scipy)

3d1c84d

Steps 13-15: Hard example mining, Curriculum learning, TTT

9abfd6b

Model 4: finalized from step15 (full Tier 1-5 implementation)

b02f616

Step 14: Scale to 12L/768d/3x (62M params, 15.8MB, 99.6% util)

ff69624

Fix Model 2 KV repeat shape for T4

c413007

Model 1 Step 1: baseline + byte frequency analysis

6a82c5e

Model 1 Step 2: + codec bigram embedding lookup

b470890

Nick Ferrante added 29 commits March 26, 2026 00:36

M5 Step 2: Remove GatedRNN (7x too slow), keep BigramEmbed + GrowthRule

8c0600d

M5 Frankenstein complete: BigramEmbed+GrowthRule=3.257 bpb. GatedRNN …

55ad570

…removed (7x too slow). All 8 models done.

M1 Step 6: Add GrowthRule from M8 Crystal (per-layer ±5% scaling)

71e1909

M1 Step 6 REGRESSION: GrowthRule hurts codec (3.193 vs 2.631). Scalin…

7b54818

…g test also worse. Step 5 stays best.

Model 1 Step 3: Add NGramContextPredictor (unigram + bigram log-prob …

e90d212

…context signal)

Full model projections: all 8 models analyzed, scaled to 16MB, projec…

2e6f69f

…ted to 8xH100. M3 projects best (1.24), M1 Codec is natural A+B hybrid

Combinatorial analysis handoff + Track B research (Prometheus + Herme…

c6eb0a2

…s): full component inventory, compatibility matrix, budget analysis, and 7 strategic questions for multi-LLM analysis

Add H100 submission scripts for M4 and M1

0e279c4

Incremental Step 1: baseline with torch.compile disabled for smoke te…

470ba8f

…sting

Step 1 fix: disable second torch.compile call

cdec9fd

3x3 strategy grid: Track A (3 neural) + Track B (3 n-gram) + Track C …

5aadac5

…(3 hybrid) = 9 models

Incremental Step 2: 9L → 11L

b11a315

Incremental Step 3: relu^2 → LeakyReLU(0.9)^2 (free -0.024 bpb from c…

45987ae

…ompetition leaders)

Model 1 Step 3: Add n-gram context predictions (unigram/bigram logpro…

9fe0bda

…b as input)

Incremental Step 4: warmdown 1200→3500 + grad_clip 0→0.3

9cbfcc1

Incremental Step 5: Muon weight decay 0.04

26cf163

Model 1 Step 3: Add NGramContextEmbed - unigram/bigram log-prob prior…

fca2228

… as input signal

M1 Step 3: Add n-gram log-prob context predictions (NgramLogProbEmbed)

230456e

Incremental Step 6: Auto-detect GQA support (enable_gqa on H100, manu…

069954d

…al expand on others) + ready for 768d

Model 1 Step 3: Add n-gram context predictions (unigram+bigram log-pr…

856a494

…ob features → NGramContextProj)

Incremental Step 7: Re-enable torch.compile for H100 submission run

1e092c0

Step 7: H100 submission script — torch.compile re-enabled, auto GQA, …

108bc7c

…768d ready

Step 7 fix: manual KV expansion (torch.compile compatible) instead of…

17847da

… enable_gqa try/except

Step 7 complete: 8×H100 full run — val_bpb 1.2149 (15.55MB submission)

bf4a4f5

Step 8: BackoffNgramMixer — eval-time n-gram augmentation (orders 2-7…

57a5450

…, entropy-adaptive alpha)

M1 Codec H100-ready: torch.compile + manual GQA + LeakyReLU + warmdow…

ae3a6c8

…n 3500 + grad clip 0.3

M1 H100 fix: find_unused_parameters=True for DDP (codec params)

e777cfa

Merge remote-tracking branch 'upstream/main'

e79ac32

nickferrantelive pushed a commit to nickferrantelive/parameter-golf that referenced this pull request Mar 26, 2026

Save progress: PR openai#858 submitted, grant applied, paused until g…

3541dda

…rant response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L 512d Int8+Zlib Baseline (val_bpb 1.2135, 3-seed)#858

11L 512d Int8+Zlib Baseline (val_bpb 1.2135, 3-seed)#858
nickferrantelive wants to merge 99 commits intoopenai:mainfrom
nickferrantelive:submission/2026-03-26_20M_Int8Zlib_Baseline

nickferrantelive commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nickferrantelive commented Mar 26, 2026

Record: 11L 512d Int8+Zlib Baseline

Summary

Changes from Naive Baseline

Results (3 seeds, 8xH100 SXM)

Architecture

Run Command

Notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant