Skip to content

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202

Open
VirajDeshwal wants to merge 1 commit intoopenai:mainfrom
VirajDeshwal:submission/2026-03-31_UnifiedAttention_FA3_LegalTTT
Open

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202
VirajDeshwal wants to merge 1 commit intoopenai:mainfrom
VirajDeshwal:submission/2026-03-31_UnifiedAttention_FA3_LegalTTT

Conversation

@VirajDeshwal
Copy link
Copy Markdown

@VirajDeshwal VirajDeshwal commented Mar 31, 2026

Unified Attention + FA3 Head-Dim Padding + Legal Score-First TTT

val_bpb: 1.1412 (3-seed mean, std 0.0008) | ~15.97 MB | 8×H100 SXM

Seed step_avg steps Post-TTT bpb Artifact
1337 49.6ms 12,088 1.1416 15,991,687
42 49.6ms 12,109 1.1416 15,988,916
2025 49.6ms 12,103 1.1403 15,962,515

What's new

  • Unified Attention (Deshwal, 2026): Single W_unified projection replaces Q/K/V. 67% fewer attention projection parameters, reallocated to MLP. Bands form naturally during training.
  • FA3 Head-Dim Padding: Zero-pad head_dim 44→48 for Hopper FA3 compatibility. Mathematically lossless, 1.57× faster than FA2. Enables 12,100 steps in 10 min.
  • Legal Score-First TTT: SGD on all params, 3 epochs per 32K chunk, stride=64.

Timing

Phase Time
Training (K=11, d=528, 4H) 600s
Quantization + roundtrip ~70s
Legal TTT ~408s
Total ~18 min (10+8)

@VirajDeshwal VirajDeshwal force-pushed the submission/2026-03-31_UnifiedAttention_FA3_LegalTTT branch 4 times, most recently from 168cecf to 2882f3a Compare March 31, 2026 22:14
@VirajDeshwal VirajDeshwal force-pushed the submission/2026-03-31_UnifiedAttention_FA3_LegalTTT branch from 2882f3a to e11ed4b Compare March 31, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant