The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB) by newjordan · Pull Request #499 · openai/parameter-golf

newjordan · 2026-03-23T04:19:20Z

The Frugendorff

6 unique transformer blocks, each applied twice in sequence, yielding 12 effective layers from 6 stored parameter sets. The parameter savings are redirected to a 4x MLP expansion.

val_bpb: 1.1478 | 15.19 MB | 8xH100 SXM, 600s, seed 1337

Architecture


Unique blocks	6
Loops per block	2
Effective depth	12
Dimension	640
Heads / KV heads	10 / 5 (GQA)
MLP expansion	4x (hidden 2560)
Activation	relu-squared
Parameters	28.2M

Orthogonal loop position embeddings (QR-initialized). U-Net skip connections within each loop iteration. SmearGate, BigramHash, shared value embeddings, XSA on last 2 blocks. Tied embeddings, logit softcap 30.

Training Pipeline

Muon (matrices) + AdamW (embeddings, scalars) · SWA · Late QAT (int6, scale < 0.15) · Training data replay (2 epochs, last 100 batches) · Self-distillation (EMA teacher, 50 steps) · EMA application · int6 + zstd export

Results

Metric	Value
Sliding window BPB (stride 64)	1.1478
Pre-quant BPB (post-EMA)	1.1572
Post-quant roundtrip BPB	1.1716
Artifact size	15,192,793 bytes
Training steps	4,396
Step time	136.5 ms

Observations

A few properties of this approach that may warrant further investigation:

The shared weights can be evaluated with fewer loops at inference time, trading quality for speed without retraining
The parameter efficiency leaves ~0.8 MB of headroom in the artifact budget, which could accommodate additional capacity
Alternating between full-depth and reduced-depth forward passes during training appeared to act as a form of regularization, though we have not yet isolated the effect
The recursive structure may compose well as a component within conventional architectures, applied selectively to a subset of layers

Compliance

No test-time training on validation data. Training replay and self-distillation operate on training data only.

…AttnRes

… gravity needs more steps

Octavian and others added 4 commits March 18, 2026 18:06

docs: fractal transformer research plan — weight sharing + gravity + …

6e503d9

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

73271f3

… gravity needs more steps

Merge branch 'openai:main' into main

ecdcedc

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB)

37aa1fe

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Add train log to Frugendorff submission

5305a17

newjordan closed this Mar 23, 2026

This was referenced Mar 25, 2026

Podracing: 1.0461 BPB (3-seed mean) #674

Closed

Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706

Open

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)#499

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)#499
newjordan wants to merge 5 commits intoopenai:mainfrom
newjordan:submission/frugendorff-clean

newjordan commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Frugendorff

Architecture

Training Pipeline

Results

Observations

Compliance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Mar 23, 2026 •

edited

Loading