Skip to content

Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1

Open
newjordan wants to merge 3 commits intomainfrom
experiments/fractal-attnres-gravity
Open

Fractal Transformer + Gravity + AttnRes — Pending Cloud Validation#1
newjordan wants to merge 3 commits intomainfrom
experiments/fractal-attnres-gravity

Conversation

@newjordan
Copy link
Copy Markdown
Owner

Summary

Experimental submission: replace 9 unique transformer layers with 3 shared layers × 3 loops at 864d (vs 512d baseline), with two novel mechanisms:

  1. Gravity — learned auxiliary losses at each loop boundary for direct supervision
  2. AttnRes — attention over previous loop outputs replacing U-Net skips (from Moonshot's paper)

Local Results (DGX Spark GB10, 300 steps, 1 shard)

Config val_bpb Δ vs baseline
Baseline (9 unique, 512d) 2.7927
Fractal only (3×3, 864d) 2.5953 -0.1975
Fractal + Gravity 2.6149 -0.1779
Fractal + Gravity + AttnRes 2.6084 -0.1843

Note: Local runs use AdamW (not Muon), no torch.compile, 300 steps. These are directional only — cloud 8×H100 run needed for official BPB.

Status

  • Local prototype working
  • 4-experiment ladder run
  • Port to official train_gpt.py with Muon + torch.compile
  • 8×H100 10-minute official run
  • Verify artifact ≤16MB

Key Insight

Weight sharing frees parameter budget for wider layers. 864d shared layers significantly outperform 512d unique layers even with fewer total parameters. Depth recurrence is explicitly listed as a promising direction in the challenge README.

References

Octavian added 3 commits March 18, 2026 18:06
…alidation

Architecture: 3 shared layers x 3 loops, 864d (vs baseline 9 unique, 512d)
- Learned gravity auxiliary losses at loop boundaries
- AttnRes replacing U-Net skips + resid_mix
- Local results: 7% BPB improvement over baseline (300 steps, directional only)
- Needs 8xH100 run for official numbers
newjordan pushed a commit that referenced this pull request Mar 20, 2026
His 0.9695 BPB is val-only training (separate track). Standard score is
1.1629, close to Larson's 1.1574. No novel architecture — wins with
tuning: stride-64 sliding window, seq_len=4096, mixed int6/int8 quant,
Muon momentum=0.99. Crucially, he uses NO weight sharing, meaning our
fractal approach is an orthogonal improvement on top of his full stack.

https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq
newjordan pushed a commit that referenced this pull request Mar 20, 2026
New features:
- MUON_WD: decoupled weight decay for Muon optimizer (from #1 entry)
- FP16_EMBED: keep tok_emb in fp16 instead of int6 (from #1 entry)
- Both controlled via env vars, off by default

Leaderboard chase script tests stride=64, fp16 embed, muon WD, and
full stack on MLP 3× base. All runs use early QAT (25%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 20, 2026
SWA: average model weights every 50 steps over last 50% of training.
Smoother weight distributions quantize better. Controlled via
SWA_EVERY and SWA_START_FRAC env vars.

MuonWD bumped from 0.01 to 0.04 (PR 162 swept 0.01-0.05, optimal at 0.04).
MLP_HIDDEN=1408 to fit under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 21, 2026
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 22, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 23, 2026
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 23, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 23, 2026
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 23, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 26, 2026
Three variants targeting the 0.187 BPB gap to #1:
- bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve)
- bwing_entropy_shift: per-order entropy center shift (isolate)
- bwing_full_port: all #809 techniques + fixed order mults (fire first)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newjordan pushed a commit that referenced this pull request Mar 27, 2026
Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489):

PR #931 (packed training oracle): after training, reads 2 train shards
(~200M tokens) and seeds eval n-gram tables before val token #1.
Eliminates cold-start penalty where early val chunks score with empty cache.
Legal: oracle is training-data-only, eval remains single-pass causal.

PR #900 (Dirichlet smoothing): replaces linear alpha mixing with
  p = (ng_count + c * neural_p) / (ctx_count + c)
Count-sensitive weighting: high-count matches trust n-gram, low-count
matches stay close to neural prior. No hand-tuned alpha per-order needed.
NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally).

PR #859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR
found across 79-experiment sweep to train stronger base model.

Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET)
for A/B isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant