Skip to content

Non-record: QAT & EMA negative results on SOTA stack (val_bpb=1.1426)#360

Open
MultiFe22 wants to merge 1 commit intoopenai:mainfrom
MultiFe22:first-try
Open

Non-record: QAT & EMA negative results on SOTA stack (val_bpb=1.1426)#360
MultiFe22 wants to merge 1 commit intoopenai:mainfrom
MultiFe22:first-try

Conversation

@MultiFe22
Copy link
Copy Markdown

Summary

Results (8xH100 SXM, 600s)

Config Steps val_bpb Artifact Delta
Baseline (PR #180 repro) 6,684 1.1426 15.99 MB
+ QAT (warmup=500) 6,143 1.1473 15.69 MB +0.005 (worse)
+ QAT + EMA 4,546 1.1606 16.89 MB +0.018 (worse)

Key findings

  • QAT: Better compression (15.69 vs 15.99 MB) but 8% fewer steps - net negative
  • EMA: .cpu().clone() every step causes 32% throughput loss - catastrophic
  • Implication: Techniques trading steps for inference quality are counterproductive under the 10-min budget

Test plan

  • Baseline reproduction matches published SOTA (1.1426 vs 1.1428)
  • QAT ablation on 8xH100
  • QAT+EMA ablation on 8xH100
  • All logs included

sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
sahiee-dev added a commit to sahiee-dev/parameter-golf that referenced this pull request Mar 22, 2026
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant