Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth#281
Closed
charmquark1984 wants to merge 2 commits intoopenai:mainfrom
Closed
Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth#281charmquark1984 wants to merge 2 commits intoopenai:mainfrom
charmquark1984 wants to merge 2 commits intoopenai:mainfrom
Conversation
|
the wd artifact size tradeoff is super useful data ngl, 0.042 for 15.5mb is a nice sweet spot. were you able to try ema instead of swa or nah |
romainsantoli-web
pushed a commit
to romainsantoli-web/parameter-golf
that referenced
this pull request
Mar 21, 2026
…its) Combines techniques from PR openai#162, openai#180, openai#267, openai#281: - 11-layer GPT with U-Net skip connections, GQA - SmearGate + BigramHash(10240) - Mixed int5/int6 quantization + 3% magnitude pruning - Causal TTT at eval time - SWA(frac=0.4), WD=0.042, Z-loss - Target: sub-1.135 val_bpb Awaiting RunPod 8xH100 credits for 3-seed validation.
Author
|
Closed because this uses non-causal TTT. New PR #375 describes learnings and negative results from keeping up with and attempting to advance the SOTA. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Farnsworth-Adapted: 11L MLP3x + INT6 + SmearGate + BigramHash + TTT + FA2 + WD Tuning
Score: val_bpb = 1.1381 (3-seed mean, sliding window stride=64, post-TTT)
Summary
Adapts the FarnsworthEngine architecture (PR #254) with FlashAttention 2 (instead of FA3 Hopper) and weight decay optimization for artifact size control. Key finding: cuDNN SDP is 40% faster per attention op than Flash SDP on H100 but produces worse model quality (1.1455 vs 1.1418 BPB). Weight decay directly controls compressed artifact size: WD=0.042 targets the optimal ~15.5MB.
Architecture
Results
Attribution
[SOTA-ADOPT] From FarnsworthEngine (PR #254)
[SOTA-ADOPT] From PR #236 (saml212)
[ORIGINAL] Findings
cuDNN SDP vs Flash SDP benchmark on H100: cuDNN is 40% faster per attention op (0.134ms vs 0.221ms) but produces worse BPB (1.1455 vs 1.1418). We verified this is a quality issue, not a speed tradeoff — cuDNN gets MORE training steps but still underperforms. This suggests cuDNN uses different internal accumulation precision that hurts final model quality.
Weight decay sweep for artifact size targeting: Systematic sweep from WD=0.040 to WD=0.050 revealed that WD=0.042 optimally targets 15.5MB (within the 16MB budget) while minimizing BPB:
QAT hurts at this scale: Enabling INT6 quantization-aware training (STE) during forward pass reduces the quant gap (0.005 vs 0.009 BPB) but increases training loss enough to negate the benefit (1.1466 vs 1.1374 overall).
INT4 quantization is a dead end for this architecture: All-INT4 (clip=7) achieves excellent pre-quant BPB (1.1521) by fitting 33.5M params instead of 26.8M, but the 0.06 BPB quantization gap makes it strictly worse than INT6 with fewer params.
FA2 on H100 is competitive: Without the FA3 Hopper-native kernels, FA2.8.3 achieves ~66ms/step (vs Farnsworth's reported 81ms with FA3). The speed advantage doesn't fully translate to BPB (1.1374 vs 1.1303), suggesting FA3 may have different numerical properties that help model quality.
Reproduction
Timing Budget
What We'd Try Next