Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) by JackYoung27 · Pull Request #302 · openai/parameter-golf

JackYoung27 · 2026-03-21T02:53:47Z

Summary

Three things are new:

Pre-Q/K RMSNorm - extra rms_norm on attention input before Q and K projections only (V gets raw input). Stabilizes the RoPE-facing path under int5/int6.
Online causal TTT with decay prior - full-weight SGD adaptation during eval with Krause-style decay (p += λ(p₀ - p)) to prevent drift. Adapts MLP weights in last 3 blocks only, per TTT-E2E.
Reptile meta-learning (last 10%) - K=1 inner step + Reptile interpolation to improve eval-time TTT adaptation.

Also uses XSA in last 3 layers (#265), int5-MLP/int6-attn (#180), BigramHash(10240) (#180), and the standard SOTA stack.

Single seed, posting as non-record to share the TTT+decay approach.

Seed	val_bpb (TTT+sliding)	Artifact
1337	1.1520	15.1 MB

…un val_bpb=1.1520)

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-r…

8e1895e

…un val_bpb=1.1520)

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open