Skip to content

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)#69

Open
TevBenji wants to merge 4 commits intoopenai:mainfrom
TevBenji:subsixteen-submission
Open

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)#69
TevBenji wants to merge 4 commits intoopenai:mainfrom
TevBenji:subsixteen-submission

Conversation

@TevBenji
Copy link
Copy Markdown

@TevBenji TevBenji commented Mar 19, 2026

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window

val_bpb: 1.1708 | Artifact: 14,603,588 bytes (under 16MB)

Results

Metric Value
Post-quant val_bpb (sliding window) 1.1708
Pre-quant val_bpb (step 9722) 1.1732
Quantization penalty +0.002 BPB
Artifact size 14,603,588 bytes
Training steps 9,722 (wallclock-limited)
Step avg 61.72ms
Peak memory 10,138 MiB

Architecture

  • 9-layer GPT, 512-dim, GQA (8 heads, 4 KV heads), U-Net skip connections
  • MLP 3x expansion (hidden=1536), relu² activation
  • 1024 vocab SentencePiece BPE, tied embeddings
  • Overtone SVD init, phase-transition resid_mix, NTK-aware RoPE
  • Logit soft-capping (tanh, cap=30)

Key Techniques (v2)

  1. STE fake-int6 QAT: CastedLinear weights fake-quantized to [-31,31] during forward via Straight-Through Estimator. Model learns distributions that survive 6-bit post-training quantization.
  2. MLP 3x expansion: Hidden dim 1536 (up from 1024). Enabled by int6 saving ~4MB artifact space. 9 layers (down from 10) to fit under 16MB.
  3. SWA (Stochastic Weight Averaging): 16 checkpoints collected every 200 steps during warmdown, averaged for export.
  4. zstd-22 compression: Better compression ratio than zlib-9 for the quantized artifact.
  5. Sliding window eval: stride=64, seq_len=4096 for near-full-context scoring.

Optimizer

  • Muon (momentum=0.99, lr=0.02) for matrix params with Newton-Schulz orthogonalization
  • AdamW for embeddings (lr=0.03) and scalars (lr=0.02)
  • Warmdown: 3000 iters, wallclock-aware schedule
  • Momentum warmup: 0.92 → 0.99 over 1500 steps

Run command

pip install zstandard
torchrun --standalone --nproc_per_node=8 train_gpt.py

Trained on 8xH100 SXM, 600s wallclock cap.

phaesoo added a commit to phaesoo/parameter-golf that referenced this pull request Mar 19, 2026
openai#77, openai#78)

Analyzed techniques, ablations, and individual BPB contributions.
Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029)
are the dominant validated techniques. Several promising combinations
remain untested across submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TevBenji TevBenji changed the title [WIP] SubSixteen: Ternary QAT + Depth Recurrence + TTT (val_bpb pending) SubSixteen: Sliding Window + FP16 Embed + 10L Flat + Muon WD + Overtone Init (val_bpb 1.1764) Mar 20, 2026
@TevBenji TevBenji marked this pull request as ready for review March 20, 2026 02:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7977ed06eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@TevBenji TevBenji changed the title SubSixteen: Sliding Window + FP16 Embed + 10L Flat + Muon WD + Overtone Init (val_bpb 1.1764) SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) Mar 20, 2026
9-layer GPT, 512-dim, GQA (8h/4kv), MLP 3x (hidden=1536)
Int6 per-row QAT with STE fake-quantization during training
SWA: 16 checkpoints averaged during warmdown
zstd-22 compression, sliding window eval (stride=64, seq_len=4096)
Artifact: 14,603,588 bytes | 9,722 steps at 61.72ms/step on 8xH100
@TevBenji TevBenji force-pushed the subsixteen-submission branch from d01ef1e to 5e64abd Compare March 20, 2026 03:43
- Add .hypothesis/ directory to ignored files
- Ignore hypothesis test framework cache artifacts
- Maintain consistency with existing cache exclusions
…al support

- Add modal_train_retrocache.py for distributed training on Modal cloud platform
- Add run_train.sh and runpod_setup.sh for local and RunPod execution
- Add test_retrocache.py with RetroCache validation tests
- Update v38_TightSWA_RetroCache record with RetroCache implementation details
- Configure RetroCache hyperparameters (32 topk, 24 beta, 0.35 lambda_max)
- Support both cached and non-cached evaluation modes via environment flags
- Implement symlink management for persistent volume data mounting
- Add streaming command execution for real-time training output visibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants