Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1
Open
Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1
Conversation
added 3 commits
March 18, 2026 18:06
… gravity needs more steps
…alidation Architecture: 3 shared layers x 3 loops, 864d (vs baseline 9 unique, 512d) - Learned gravity auxiliary losses at loop boundaries - AttnRes replacing U-Net skips + resid_mix - Local results: 7% BPB improvement over baseline (300 steps, directional only) - Needs 8xH100 run for official numbers
newjordan
pushed a commit
that referenced
this pull request
Mar 20, 2026
His 0.9695 BPB is val-only training (separate track). Standard score is 1.1629, close to Larson's 1.1574. No novel architecture — wins with tuning: stride-64 sliding window, seq_len=4096, mixed int6/int8 quant, Muon momentum=0.99. Crucially, he uses NO weight sharing, meaning our fractal approach is an orthogonal improvement on top of his full stack. https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq
newjordan
pushed a commit
that referenced
this pull request
Mar 20, 2026
New features: - MUON_WD: decoupled weight decay for Muon optimizer (from #1 entry) - FP16_EMBED: keep tok_emb in fp16 instead of int6 (from #1 entry) - Both controlled via env vars, off by default Leaderboard chase script tests stride=64, fp16 embed, muon WD, and full stack on MLP 3× base. All runs use early QAT (25%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 20, 2026
SWA: average model weights every 50 steps over last 50% of training. Smoother weight distributions quantize better. Controlled via SWA_EVERY and SWA_START_FRAC env vars. MuonWD bumped from 0.01 to 0.04 (PR 162 swept 0.01-0.05, optimal at 0.04). MLP_HIDDEN=1408 to fit under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 21, 2026
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 23, 2026
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 23, 2026
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 26, 2026
Three variants targeting the 0.187 BPB gap to #1: - bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve) - bwing_entropy_shift: per-order entropy center shift (isolate) - bwing_full_port: all #809 techniques + fixed order mults (fire first) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newjordan
pushed a commit
that referenced
this pull request
Mar 27, 2026
Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489): PR #931 (packed training oracle): after training, reads 2 train shards (~200M tokens) and seeds eval n-gram tables before val token #1. Eliminates cold-start penalty where early val chunks score with empty cache. Legal: oracle is training-data-only, eval remains single-pass causal. PR #900 (Dirichlet smoothing): replaces linear alpha mixing with p = (ng_count + c * neural_p) / (ctx_count + c) Count-sensitive weighting: high-count matches trust n-gram, low-count matches stay close to neural prior. No hand-tuned alpha per-order needed. NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally). PR #859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR found across 79-experiment sweep to train stronger base model. Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET) for A/B isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Experimental submission: replace 9 unique transformer layers with 3 shared layers × 3 loops at 864d (vs 512d baseline), with two novel mechanisms:
Local Results (DGX Spark GB10, 300 steps, 1 shard)
Note: Local runs use AdamW (not Muon), no torch.compile, 300 steps. These are directional only — cloud 8×H100 run needed for official BPB.
Status
Key Insight
Weight sharing frees parameter budget for wider layers. 864d shared layers significantly outperform 512d unique layers even with fewer total parameters. Depth recurrence is explicitly listed as a promising direction in the challenge README.
References