The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB) by newjordan · Pull Request #579 · openai/parameter-golf

newjordan · 2026-03-23T22:14:38Z

The Frugendorff: Recursive Weight Sharing Under Extreme Compression

Research submission documenting the full arc of fractal weight sharing — from the original Frugendorff (6x2 symmetric, 1.1478) through the Micro Crawler (4f+2cx2, 1.1325), including systematic ablations that map the tradeoffs of recursion under compression.

This is not a SOTA submission. It is a research direction exploring whether recursive weight reuse can improve compute-per-byte efficiency in size-constrained language models.

Best Result

val_bpb: 1.1325 (sliding window, stride 64) — Micro Crawler, cad0 (no double-firing)
val_bpb: 1.1355 — Micro Crawler + bidirectional PD, cad2
val_bpb: 1.1478 — Original Frugendorff Squared (6x2 uniform)
8xH100 SXM, 600s, seed 1337

Architecture Evolution

Stage 1: Frugendorff Squared (1.1478 BPB)

6 unique blocks x 2 loops = 12 effective depth. All blocks shared. MLP 4x enabled by parameter savings.

Component	Detail
Recursive blocks	6 unique x 2 loops = 12 effective depth
MLP expansion	4x (hidden 2560)
Loop positions	QR-initialized orthogonal vectors
Attention	GQA (10H/5KV), XSA last 2
Parameters	28.2M stored, 15.15MB artifact

Stage 2: Micro Crawler (1.1377 -> 1.1325 BPB)

Asymmetric sharing — 4 unique flat blocks + 2 shared crawler blocks. Flat section trains with zero gradient conflict. Only the crawler pair shares weights.

Component	Detail
Flat blocks	4 unique, run once
Crawler blocks	2 shared x 2 loops
Effective depth	8 (4 flat + 2x2 crawler)
Quantization	GPTQ Hessian-aware
Parameters	29.8M stored, ~16.5MB artifact

Stage 3: Persistent Deliberation

The crawler's two orthogonal firings deliberate through a learned gate. Critical discovery: the gate needs bidirectional gradient flow — consensus_ref as nn.Parameter, not a detached buffer. Gradients flow IN (loss -> ref) and OUT (ref -> blocks) on every step.

PD showed mid-training advantages (+0.007 BPP ahead at steps 5000-7000) but the gains were fragile under EMA smoothing.

Cadence Ablation (H1 + H2)

Systematic sweep of C/N ratio (double-fire vs single-fire steps) across two architectures at 0.25 scale (150s, 8xH100) and full scale (600s).

H1: 4f+2cx2 Cadence Sweep (0.25 scale)

Cadence	C-step ratio	Steps	val@500	Sliding BPB	Quant Gap
1 (all C)	100%	702	1.3842	1.5092	0.136
2 (C/N)	50%	810	1.3841	1.4222	0.081
3 (C/N/N)	33%	854	1.3839	1.3941	0.061
4 (C/N/N/N)	25%	878	1.3838	1.3836	0.059

H1: Full Scale Confirmation (600s, production script)

Config	Steps	step_avg	post_ema	Sliding BPB	Quant Gap	Peak Memory
Run 8 (cad2)	7,076	85ms	1.1535	1.1355	0.0075	33,182 MiB
cad0 (no C)	7,856	76ms	1.1487	1.1325	0.0070	22,854 MiB

H2: 3f+3cx2 Cadence Sweep (0.25 scale)

Cadence	Steps	val@500	Sliding BPB	Quant Gap
1 (all C)	612	1.3876	1.6007	0.196
2 (C/N)	738	1.3822	1.4587	0.099
3 (C/N/N)	792	1.3828	1.4211	0.078
4 (C/N/N/N)	822	1.3815	1.4030	0.066

H4: Crawler Bank at U-Net Bottleneck

Shared block at the encoder/decoder bottleneck of GS v7: learns better per step (+0.016 BPP at step 1500) but loses on final sliding BPP (1.2371 vs 1.2145 control) due to 14% fewer steps.

Key Findings

What Works

Asymmetric > uniform sharing. 4f+2cx2 beats 6x2 by 0.010 BPP. Isolate gradient conflict to minimal shared blocks.
GPTQ is essential for shared weights. Quant gap drops from 0.0146 -> 0.0070.
MLP 4x is the primary quality driver — weight sharing is the compression technique that enables it.

Challenges with Current Recursion Implementation

The current double-firing mechanism shows real per-step learning benefit (crawler bank: +0.016 per step) but is challenged under wallclock constraints:

Compute cost: C-steps are ~2x FLOP, reducing total steps by 10-20%
EMA instability: Double-firing creates weight oscillation EMA can't track (gap: 0.105 at cad1 vs 0.053 at cad4)
Quantization sensitivity: Quant gap scales with reuse frequency (0.030 at cad1 -> 0.006 at cad4)
val@500 identical across cadences (1.384 +/- 0.0004) — C-steps are neutral per step
Deeper stacks amplify these issues: 3f+3cx2 always worse than 4f+2cx2, with 6x2 cad1 going backwards after step 500

These are implementation-specific challenges, not fundamental limits. A cheaper recurrence mechanism (lightweight adapter loops, partial-block refire, amortized recursion) could capture the per-step learning benefit without the wallclock and EMA penalties.

Transferable Findings

EMA instability from parameter reuse — any weight-tied architecture (Universal Transformers, LoRA, MoE) suffers EMA tracking degradation proportional to reuse frequency
Training dynamics -> quantization robustness — how parameters are updated during training directly affects quant quality. 5x quant gap reduction from cad1 to cad4
Asymmetric parameter allocation — more unique + fewer shared is strictly better than balanced sharing

Full Results Table

Run	Description	Sliding BPB	Post-EMA	Quant Gap	Steps	Artifact
Frug v2	6x2 symmetric	1.1478	1.1570	0.0146	4,390	15.15MB
MC Run 1	4f+2cx2, per-row quant	1.1377	1.1513	0.0097	7,694	16.86MB
MC Run 3	+ self-ref gate (C-only) + GPTQ	1.1415	1.1575	0.0072	7,150	16.33MB
MC Run 6	+ PD gate (detached EMA) + GPTQ	1.1375	1.1535	0.0075	7,076	16.65MB
MC Run 8	+ bidir PD + fixed cad2 + GPTQ	1.1355	1.1522	0.0075	6,839	17.04MB
MC cad0	No double-fire + GPTQ	1.1325	1.1487	0.0070	7,856	~16.5MB

Compliance

No test-time training on validation data. Training replay and self-distillation operate on training data only. All evaluation follows score-first protocol per issue #402.

Generated with Claude Code

…AttnRes

… gravity needs more steps

…tions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eparation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan · 2026-03-24T20:33:07Z

bad data...

Octavian and others added 6 commits March 23, 2026 16:48

docs: fractal transformer research plan — weight sharing + gravity + …

035ebb1

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

45c50f8

… gravity needs more steps

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB)

4689484

Add train log to Frugendorff submission

0863560

Update Frugendorff PR with full experiment history and research direc…

513fd61

…tions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrite Frugendorff PR: honest framing, clean architecture/training s…

d0c4271

…eparation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

newjordan closed this Mar 24, 2026

newjordan reopened this Mar 24, 2026

newjordan changed the title ~~The Frugendorff: Recursive Weight Sharing for Transformer Compression (1.1478 BPB, 15.19MB)~~ The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB) Mar 24, 2026

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579

The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579
newjordan wants to merge 6 commits intoopenai:mainfrom
newjordan:submission/frugendorff-research

newjordan commented Mar 23, 2026 •

edited

Loading

Uh oh!

newjordan commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Frugendorff: Recursive Weight Sharing Under Extreme Compression

Best Result

Architecture Evolution

Stage 1: Frugendorff Squared (1.1478 BPB)

Stage 2: Micro Crawler (1.1377 -> 1.1325 BPB)

Stage 3: Persistent Deliberation

Cadence Ablation (H1 + H2)

H1: 4f+2cx2 Cadence Sweep (0.25 scale)

H1: Full Scale Confirmation (600s, production script)

H2: 3f+3cx2 Cadence Sweep (0.25 scale)

H4: Crawler Bank at U-Net Bottleneck

Key Findings

What Works

Challenges with Current Recursion Implementation

Transferable Findings

Full Results Table

Compliance

Uh oh!

newjordan commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Mar 23, 2026 •

edited

Loading