The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579
Open
newjordan wants to merge 6 commits intoopenai:mainfrom
Open
The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB)#579newjordan wants to merge 6 commits intoopenai:mainfrom
newjordan wants to merge 6 commits intoopenai:mainfrom
Conversation
… gravity needs more steps
…tions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eparation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
bad data... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Frugendorff: Recursive Weight Sharing Under Extreme Compression
Research submission documenting the full arc of fractal weight sharing — from the original Frugendorff (6x2 symmetric, 1.1478) through the Micro Crawler (4f+2cx2, 1.1325), including systematic ablations that map the tradeoffs of recursion under compression.
This is not a SOTA submission. It is a research direction exploring whether recursive weight reuse can improve compute-per-byte efficiency in size-constrained language models.
Best Result
Architecture Evolution
Stage 1: Frugendorff Squared (1.1478 BPB)
6 unique blocks x 2 loops = 12 effective depth. All blocks shared. MLP 4x enabled by parameter savings.
Stage 2: Micro Crawler (1.1377 -> 1.1325 BPB)
Asymmetric sharing — 4 unique flat blocks + 2 shared crawler blocks. Flat section trains with zero gradient conflict. Only the crawler pair shares weights.
Stage 3: Persistent Deliberation
The crawler's two orthogonal firings deliberate through a learned gate. Critical discovery: the gate needs bidirectional gradient flow —
consensus_refasnn.Parameter, not a detached buffer. Gradients flow IN (loss -> ref) and OUT (ref -> blocks) on every step.PD showed mid-training advantages (+0.007 BPP ahead at steps 5000-7000) but the gains were fragile under EMA smoothing.
Cadence Ablation (H1 + H2)
Systematic sweep of C/N ratio (double-fire vs single-fire steps) across two architectures at 0.25 scale (150s, 8xH100) and full scale (600s).
H1: 4f+2cx2 Cadence Sweep (0.25 scale)
H1: Full Scale Confirmation (600s, production script)
H2: 3f+3cx2 Cadence Sweep (0.25 scale)
H4: Crawler Bank at U-Net Bottleneck
Shared block at the encoder/decoder bottleneck of GS v7: learns better per step (+0.016 BPP at step 1500) but loses on final sliding BPP (1.2371 vs 1.2145 control) due to 14% fewer steps.
Key Findings
What Works
Challenges with Current Recursion Implementation
The current double-firing mechanism shows real per-step learning benefit (crawler bank: +0.016 per step) but is challenged under wallclock constraints:
These are implementation-specific challenges, not fundamental limits. A cheaper recurrence mechanism (lightweight adapter loops, partial-block refire, amortized recursion) could capture the per-step learning benefit without the wallclock and EMA penalties.
Transferable Findings
Full Results Table
Compliance
No test-time training on validation data. Training replay and self-distillation operate on training data only. All evaluation follows score-first protocol per issue #402.
Generated with Claude Code