Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation by newjordan · Pull Request #1 · newjordan/parameter-golf

newjordan · 2026-03-19T00:48:16Z

Summary

Experimental submission: replace 9 unique transformer layers with 3 shared layers × 3 loops at 864d (vs 512d baseline), with two novel mechanisms:

Gravity — learned auxiliary losses at each loop boundary for direct supervision
AttnRes — attention over previous loop outputs replacing U-Net skips (from Moonshot's paper)

Local Results (DGX Spark GB10, 300 steps, 1 shard)

Config	val_bpb	Δ vs baseline
Baseline (9 unique, 512d)	2.7927	—
Fractal only (3×3, 864d)	2.5953	-0.1975
Fractal + Gravity	2.6149	-0.1779
Fractal + Gravity + AttnRes	2.6084	-0.1843

Note: Local runs use AdamW (not Muon), no torch.compile, 300 steps. These are directional only — cloud 8×H100 run needed for official BPB.

Status

Local prototype working
4-experiment ladder run
Port to official train_gpt.py with Muon + torch.compile
8×H100 10-minute official run
Verify artifact ≤16MB

Key Insight

Weight sharing frees parameter budget for wider layers. 864d shared layers significantly outperform 512d unique layers even with fewer total parameters. Depth recurrence is explicitly listed as a promising direction in the challenge README.

References

…AttnRes

… gravity needs more steps

…alidation Architecture: 3 shared layers x 3 loops, 864d (vs baseline 9 unique, 512d) - Learned gravity auxiliary losses at loop boundaries - AttnRes replacing U-Net skips + resid_mix - Local results: 7% BPB improvement over baseline (300 steps, directional only) - Needs 8xH100 run for official numbers

His 0.9695 BPB is val-only training (separate track). Standard score is 1.1629, close to Larson's 1.1574. No novel architecture — wins with tuning: stride-64 sliding window, seq_len=4096, mixed int6/int8 quant, Muon momentum=0.99. Crucially, he uses NO weight sharing, meaning our fractal approach is an orthogonal improvement on top of his full stack. https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq

New features: - MUON_WD: decoupled weight decay for Muon optimizer (from #1 entry) - FP16_EMBED: keep tok_emb in fp16 instead of int6 (from #1 entry) - Both controlled via env vars, off by default Leaderboard chase script tests stride=64, fp16 embed, muon WD, and full stack on MLP 3× base. All runs use early QAT (25%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SWA: average model weights every 50 steps over last 50% of training. Smoother weight distributions quantize better. Controlled via SWA_EVERY and SWA_START_FRAC env vars. MuonWD bumped from 0.01 to 0.04 (PR 162 swept 0.01-0.05, optimal at 0.04). MLP_HIDDEN=1408 to fit under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500). 3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500). 3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500). 3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three variants targeting the 0.187 BPB gap to #1: - bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve) - bwing_entropy_shift: per-order entropy center shift (isolate) - bwing_full_port: all #809 techniques + fixed order mults (fire first) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489): PR #931 (packed training oracle): after training, reads 2 train shards (~200M tokens) and seeds eval n-gram tables before val token #1. Eliminates cold-start penalty where early val chunks score with empty cache. Legal: oracle is training-data-only, eval remains single-pass causal. PR #900 (Dirichlet smoothing): replaces linear alpha mixing with p = (ng_count + c * neural_p) / (ctx_count + c) Count-sensitive weighting: high-count matches trust n-gram, low-count matches stay close to neural prior. No hand-tuned alpha per-order needed. NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally). PR #859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR found across 79-experiment sweep to train stronger base model. Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET) for A/B isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Octavian added 3 commits March 18, 2026 18:06

docs: fractal transformer research plan — weight sharing + gravity + …

6e503d9

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

73271f3

… gravity needs more steps

newjordan force-pushed the main branch from ecdcedc to ebda3af Compare March 23, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1

Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1
newjordan wants to merge 3 commits intomainfrom
experiments/fractal-attnres-gravity

newjordan commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 19, 2026

Summary

Local Results (DGX Spark GB10, 300 steps, 1 shard)

Status

Key Insight

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant