Skip to content

The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579

Open
newjordan wants to merge 6 commits intoopenai:mainfrom
newjordan:submission/frugendorff-research
Open

The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579
newjordan wants to merge 6 commits intoopenai:mainfrom
newjordan:submission/frugendorff-research

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 23, 2026

frugendorff

The Frugendorff: Recursive Weight Sharing Under Extreme Compression

Research submission documenting the full arc of fractal weight sharing — from the original Frugendorff (6x2 symmetric, 1.1478) through the Micro Crawler (4f+2cx2, 1.1325), including systematic ablations that map the tradeoffs of recursion under compression.

This is not a SOTA submission. It is a research direction exploring whether recursive weight reuse can improve compute-per-byte efficiency in size-constrained language models.

Best Result

  • val_bpb: 1.1325 (sliding window, stride 64) — Micro Crawler, cad0 (no double-firing)
  • val_bpb: 1.1355 — Micro Crawler + bidirectional PD, cad2
  • val_bpb: 1.1478 — Original Frugendorff Squared (6x2 uniform)
  • 8xH100 SXM, 600s, seed 1337

Architecture Evolution

Stage 1: Frugendorff Squared (1.1478 BPB)

6 unique blocks x 2 loops = 12 effective depth. All blocks shared. MLP 4x enabled by parameter savings.

Component Detail
Recursive blocks 6 unique x 2 loops = 12 effective depth
MLP expansion 4x (hidden 2560)
Loop positions QR-initialized orthogonal vectors
Attention GQA (10H/5KV), XSA last 2
Parameters 28.2M stored, 15.15MB artifact

Stage 2: Micro Crawler (1.1377 -> 1.1325 BPB)

Asymmetric sharing — 4 unique flat blocks + 2 shared crawler blocks. Flat section trains with zero gradient conflict. Only the crawler pair shares weights.

Component Detail
Flat blocks 4 unique, run once
Crawler blocks 2 shared x 2 loops
Effective depth 8 (4 flat + 2x2 crawler)
Quantization GPTQ Hessian-aware
Parameters 29.8M stored, ~16.5MB artifact

Stage 3: Persistent Deliberation

The crawler's two orthogonal firings deliberate through a learned gate. Critical discovery: the gate needs bidirectional gradient flowconsensus_ref as nn.Parameter, not a detached buffer. Gradients flow IN (loss -> ref) and OUT (ref -> blocks) on every step.

PD showed mid-training advantages (+0.007 BPP ahead at steps 5000-7000) but the gains were fragile under EMA smoothing.

Cadence Ablation (H1 + H2)

Systematic sweep of C/N ratio (double-fire vs single-fire steps) across two architectures at 0.25 scale (150s, 8xH100) and full scale (600s).

H1: 4f+2cx2 Cadence Sweep (0.25 scale)

Cadence C-step ratio Steps val@500 Sliding BPB Quant Gap
1 (all C) 100% 702 1.3842 1.5092 0.136
2 (C/N) 50% 810 1.3841 1.4222 0.081
3 (C/N/N) 33% 854 1.3839 1.3941 0.061
4 (C/N/N/N) 25% 878 1.3838 1.3836 0.059

H1: Full Scale Confirmation (600s, production script)

Config Steps step_avg post_ema Sliding BPB Quant Gap Peak Memory
Run 8 (cad2) 7,076 85ms 1.1535 1.1355 0.0075 33,182 MiB
cad0 (no C) 7,856 76ms 1.1487 1.1325 0.0070 22,854 MiB

H2: 3f+3cx2 Cadence Sweep (0.25 scale)

Cadence Steps val@500 Sliding BPB Quant Gap
1 (all C) 612 1.3876 1.6007 0.196
2 (C/N) 738 1.3822 1.4587 0.099
3 (C/N/N) 792 1.3828 1.4211 0.078
4 (C/N/N/N) 822 1.3815 1.4030 0.066

H4: Crawler Bank at U-Net Bottleneck

Shared block at the encoder/decoder bottleneck of GS v7: learns better per step (+0.016 BPP at step 1500) but loses on final sliding BPP (1.2371 vs 1.2145 control) due to 14% fewer steps.

Key Findings

What Works

  1. Asymmetric > uniform sharing. 4f+2cx2 beats 6x2 by 0.010 BPP. Isolate gradient conflict to minimal shared blocks.
  2. GPTQ is essential for shared weights. Quant gap drops from 0.0146 -> 0.0070.
  3. MLP 4x is the primary quality driver — weight sharing is the compression technique that enables it.

Challenges with Current Recursion Implementation

The current double-firing mechanism shows real per-step learning benefit (crawler bank: +0.016 per step) but is challenged under wallclock constraints:

  • Compute cost: C-steps are ~2x FLOP, reducing total steps by 10-20%
  • EMA instability: Double-firing creates weight oscillation EMA can't track (gap: 0.105 at cad1 vs 0.053 at cad4)
  • Quantization sensitivity: Quant gap scales with reuse frequency (0.030 at cad1 -> 0.006 at cad4)
  • val@500 identical across cadences (1.384 +/- 0.0004) — C-steps are neutral per step
  • Deeper stacks amplify these issues: 3f+3cx2 always worse than 4f+2cx2, with 6x2 cad1 going backwards after step 500

These are implementation-specific challenges, not fundamental limits. A cheaper recurrence mechanism (lightweight adapter loops, partial-block refire, amortized recursion) could capture the per-step learning benefit without the wallclock and EMA penalties.

Transferable Findings

  1. EMA instability from parameter reuse — any weight-tied architecture (Universal Transformers, LoRA, MoE) suffers EMA tracking degradation proportional to reuse frequency
  2. Training dynamics -> quantization robustness — how parameters are updated during training directly affects quant quality. 5x quant gap reduction from cad1 to cad4
  3. Asymmetric parameter allocation — more unique + fewer shared is strictly better than balanced sharing

Full Results Table

Run Description Sliding BPB Post-EMA Quant Gap Steps Artifact
Frug v2 6x2 symmetric 1.1478 1.1570 0.0146 4,390 15.15MB
MC Run 1 4f+2cx2, per-row quant 1.1377 1.1513 0.0097 7,694 16.86MB
MC Run 3 + self-ref gate (C-only) + GPTQ 1.1415 1.1575 0.0072 7,150 16.33MB
MC Run 6 + PD gate (detached EMA) + GPTQ 1.1375 1.1535 0.0075 7,076 16.65MB
MC Run 8 + bidir PD + fixed cad2 + GPTQ 1.1355 1.1522 0.0075 6,839 17.04MB
MC cad0 No double-fire + GPTQ 1.1325 1.1487 0.0070 7,856 ~16.5MB

Compliance

No test-time training on validation data. Training replay and self-distillation operate on training data only. All evaluation follows score-first protocol per issue #402.

Generated with Claude Code

@newjordan
Copy link
Copy Markdown
Author

bad data...

@newjordan newjordan reopened this Mar 24, 2026
@newjordan newjordan changed the title The Frugendorff: Recursive Weight Sharing for Transformer Compression (1.1478 BPB, 15.19MB) The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB) Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant