SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) by TevBenji · Pull Request #69 · openai/parameter-golf

TevBenji · 2026-03-19T08:52:18Z

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window

val_bpb: 1.1708 | Artifact: 14,603,588 bytes (under 16MB)

Results

Metric	Value
Post-quant val_bpb (sliding window)	1.1708
Pre-quant val_bpb (step 9722)	1.1732
Quantization penalty	+0.002 BPB
Artifact size	14,603,588 bytes
Training steps	9,722 (wallclock-limited)
Step avg	61.72ms
Peak memory	10,138 MiB

Architecture

9-layer GPT, 512-dim, GQA (8 heads, 4 KV heads), U-Net skip connections
MLP 3x expansion (hidden=1536), relu² activation
1024 vocab SentencePiece BPE, tied embeddings
Overtone SVD init, phase-transition resid_mix, NTK-aware RoPE
Logit soft-capping (tanh, cap=30)

Key Techniques (v2)

STE fake-int6 QAT: CastedLinear weights fake-quantized to [-31,31] during forward via Straight-Through Estimator. Model learns distributions that survive 6-bit post-training quantization.
MLP 3x expansion: Hidden dim 1536 (up from 1024). Enabled by int6 saving ~4MB artifact space. 9 layers (down from 10) to fit under 16MB.
SWA (Stochastic Weight Averaging): 16 checkpoints collected every 200 steps during warmdown, averaged for export.
zstd-22 compression: Better compression ratio than zlib-9 for the quantized artifact.
Sliding window eval: stride=64, seq_len=4096 for near-full-context scoring.

Optimizer

Muon (momentum=0.99, lr=0.02) for matrix params with Newton-Schulz orthogonalization
AdamW for embeddings (lr=0.03) and scalars (lr=0.02)
Warmdown: 3000 iters, wallclock-aware schedule
Momentum warmup: 0.92 → 0.99 over 1500 steps

Run command

pip install zstandard
torchrun --standalone --nproc_per_node=8 train_gpt.py

Trained on 8xH100 SXM, 600s wallclock cap.

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7977ed06eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

train_gpt.py

tests/test_compression_properties.py

train_gpt.py

9-layer GPT, 512-dim, GQA (8h/4kv), MLP 3x (hidden=1536) Int6 per-row QAT with STE fake-quantization during training SWA: 16 checkpoints averaged during warmdown zstd-22 compression, sliding window eval (stride=64, seq_len=4096) Artifact: 14,603,588 bytes | 9,722 steps at 61.72ms/step on 8xH100

- Add .hypothesis/ directory to ignored files - Ignore hypothesis test framework cache artifacts - Maintain consistency with existing cache exclusions

…A4 (val_bpb: 1.1246)

…al support - Add modal_train_retrocache.py for distributed training on Modal cloud platform - Add run_train.sh and runpod_setup.sh for local and RunPod execution - Add test_retrocache.py with RetroCache validation tests - Update v38_TightSWA_RetroCache record with RetroCache implementation details - Configure RetroCache hyperparameters (32 topk, 24 beta, 0.35 lambda_max) - Support both cached and non-cached evaluation modes via environment flags - Implement symlink management for persistent volume data mounting - Add streaming command execution for real-time training output visibility

0hq added the not ready for review label Mar 19, 2026

TevBenji changed the title ~~[WIP] SubSixteen: Ternary QAT + Depth Recurrence + TTT (val_bpb pending)~~ SubSixteen: Sliding Window + FP16 Embed + 10L Flat + Muon WD + Overtone Init (val_bpb 1.1764) Mar 20, 2026

TevBenji marked this pull request as ready for review March 20, 2026 02:47

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

train_gpt.py Outdated Show resolved Hide resolved

tests/test_compression_properties.py Outdated Show resolved Hide resolved

train_gpt.py Outdated Show resolved Hide resolved

TevBenji changed the title ~~SubSixteen: Sliding Window + FP16 Embed + 10L Flat + Muon WD + Overtone Init (val_bpb 1.1764)~~ SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) Mar 20, 2026

TevBenji force-pushed the subsixteen-submission branch from d01ef1e to 5e64abd Compare March 20, 2026 03:43

TevBenji added 3 commits March 19, 2026 22:45

chore(.gitignore): Add hypothesis cache directory to gitignore

42a9200

- Add .hypothesis/ directory to ignored files - Ignore hypothesis test framework cache artifacts - Maintain consistency with existing cache exclusions

Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XS…

a20f77b

…A4 (val_bpb: 1.1246)

This was referenced Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

Christopher-Lee-McClendon mentioned this pull request Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)#69

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)#69
TevBenji wants to merge 4 commits intoopenai:mainfrom
TevBenji:subsixteen-submission

TevBenji commented Mar 19, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TevBenji commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window

Results

Architecture

Key Techniques (v2)

Optimizer

Run command

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TevBenji commented Mar 19, 2026 •

edited

Loading