Skip to content

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520)#302

Open
JackYoung27 wants to merge 1 commit intoopenai:mainfrom
JackYoung27:ttt-decay-prior
Open

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520)#302
JackYoung27 wants to merge 1 commit intoopenai:mainfrom
JackYoung27:ttt-decay-prior

Conversation

@JackYoung27
Copy link
Copy Markdown

Summary

Three things are new:

  1. Pre-Q/K RMSNorm - extra rms_norm on attention input before Q and K projections only (V gets raw input). Stabilizes the RoPE-facing path under int5/int6.

  2. Online causal TTT with decay prior - full-weight SGD adaptation during eval with Krause-style decay (p += λ(p₀ - p)) to prevent drift. Adapts MLP weights in last 3 blocks only, per TTT-E2E.

  3. Reptile meta-learning (last 10%) - K=1 inner step + Reptile interpolation to improve eval-time TTT adaptation.

Also uses XSA in last 3 layers (#265), int5-MLP/int6-attn (#180), BigramHash(10240) (#180), and the standard SOTA stack.

Single seed, posting as non-record to share the TTT+decay approach.

Seed val_bpb (TTT+sliding) Artifact
1337 1.1520 15.1 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant