Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality by andrewbaggio1 · Pull Request #1186 · openai/parameter-golf

andrewbaggio1 · 2026-03-31T18:15:05Z

Summary

~15 experiments that didn't work or were marginal on the LeakyReLU(0.5)² stack (PR #518 architecture). Documenting these so others don't repeat them.

Architecture:

Depth recurrence (Huginn-style): +0.20 BPB — compute overhead dominates
TrigramHash: +0.045 BPB — quantization destroys small weights
MLP 3.25x: marginal gain but artifact exceeds 16MB
XSA-all (11 layers): neutral at MLP 3.0

TTT variants:

SGD+momentum: +0.065 BPB worse than AdamW — adaptive LR matters
MLP-only TTT: +0.237 BPB — needs meta-learning to work
TTT LR floor 0.05: +0.015 BPB — cosine should decay to 0
20 vs 30 epochs: 10 extra epochs worth ~0.03 BPB

Quantization:

int5 post-training swap: catastrophic — must train with int5 QAT
1xH100 training: not viable proxy for 8xH100 (8x fewer optimizer steps)

N-gram cache:

Hashed n-gram caches are fundamentally broken, not just illegal. Hash collisions inflate all token probabilities — "improvement" tracks collision density, not prediction quality. Confirmed by abaybektursun's PR RFC: How to Clean Up All the Parameter Golf Submissions #886 analysis.

Key takeaway

At 16MB, architecture is converged. Eval-time AdamW TTT with cosine LR is the remaining legal lever. Everything else is noise.

Test plan

All experiments ran to completion
BPB numbers verified from logs
Illegal approaches clearly labeled
Documentation only — no artifacts or code

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ments Documents ~15 experiments that didn't work or were marginal on the LeakyReLU² stack. Covers depth recurrence, TrigramHash, MLP expansion, SGD vs AdamW TTT, int5 post-training swap, n-gram cache illegality, and 1xH100 vs 8xH100 viability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewbaggio1 and others added 2 commits March 25, 2026 11:34

Record: Cosine TTT + N-gram Cache (3-seed mean val_bpb=0.9850)

30ba8e4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 31, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

oneKn8 mentioned this pull request Apr 1, 2026

Non-record: Universal Transformer (4h unlimited compute track) #1206

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality#1186

Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality#1186
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:negative-results-mar31

andrewbaggio1 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewbaggio1 commented Mar 31, 2026

Summary

Key takeaway

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant