Skip to content

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)#499

Closed
newjordan wants to merge 5 commits intoopenai:mainfrom
newjordan:submission/frugendorff-clean
Closed

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)#499
newjordan wants to merge 5 commits intoopenai:mainfrom
newjordan:submission/frugendorff-clean

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 23, 2026

frugendorff

The Frugendorff

6 unique transformer blocks, each applied twice in sequence, yielding 12 effective layers from 6 stored parameter sets. The parameter savings are redirected to a 4x MLP expansion.

val_bpb: 1.1478 | 15.19 MB | 8xH100 SXM, 600s, seed 1337

Architecture

Unique blocks 6
Loops per block 2
Effective depth 12
Dimension 640
Heads / KV heads 10 / 5 (GQA)
MLP expansion 4x (hidden 2560)
Activation relu-squared
Parameters 28.2M

Orthogonal loop position embeddings (QR-initialized). U-Net skip connections within each loop iteration. SmearGate, BigramHash, shared value embeddings, XSA on last 2 blocks. Tied embeddings, logit softcap 30.

Training Pipeline

Muon (matrices) + AdamW (embeddings, scalars) · SWA · Late QAT (int6, scale < 0.15) · Training data replay (2 epochs, last 100 batches) · Self-distillation (EMA teacher, 50 steps) · EMA application · int6 + zstd export

Results

Metric Value
Sliding window BPB (stride 64) 1.1478
Pre-quant BPB (post-EMA) 1.1572
Post-quant roundtrip BPB 1.1716
Artifact size 15,192,793 bytes
Training steps 4,396
Step time 136.5 ms

Observations

A few properties of this approach that may warrant further investigation:

  • The shared weights can be evaluated with fewer loops at inference time, trading quality for speed without retraining
  • The parameter efficiency leaves ~0.8 MB of headroom in the artifact budget, which could accommodate additional capacity
  • Alternating between full-depth and reduced-depth forward passes during training appeared to act as a form of regularization, though we have not yet isolated the effect
  • The recursive structure may compose well as a component within conventional architectures, applied selectively to a subset of layers

Compliance

No test-time training on validation data. Training replay and self-distillation operate on training data only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant