Medusa: Unstable — DeltaNet Crawler 0.8104 BPB 10mb file size(best seed), mean 0.9984, Frugendorff continuation#1028
Medusa: Unstable — DeltaNet Crawler 0.8104 BPB 10mb file size(best seed), mean 0.9984, Frugendorff continuation#1028newjordan wants to merge 1 commit intoopenai:mainfrom
Conversation
…ivot - Log PR openai#771 CLOSED (TTT rules violation: adapt-then-score same tokens) - Update competition strategy: pivot from AdamW TTT to n-gram eval cache - Document legal TTT definition (backward-looking only, already-graded chunks) - Track new open PRs: openai#933 (0.0804), openai#758 (1.0465), openai#1028 (0.9984 unstable) - Add Session 4 lessons learned (lessons 17-20) - Update abandoned approaches and key reference PRs in CLAUDE.md https://claude.ai/code/session_0173mhLdyzis2j7NKyvDQ8ST
|
Hi, this looks potentially interesting but the PR is currently way too large for me to be able to review it or really understand what I should be looking at (+207k lines of code!). Would you mind cutting it down to the |
|
I would be happy to. It's just a frugendorff squared with a reverse polarized k hole into a shrodingers tube sock, a very simple little trick. |
Successor to PR openai#990 (ClownCar, 1.1813 BPB). Catalyzed by PR openai#875 (@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (chunk_delta_rule), loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99). 4 flat + 1 crawler x 4 loops, INST_DIM=32, 8xH100 SXM. Seeds: 42=0.8104, 300=0.9578, 1337=1.2269 SW BPB. High cross-seed variance (std=0.1724 vs ClownCar 0.00015) — stabilization ongoing in Medusa_VII. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5f731b3 to
37df300
Compare
|
Thanks! This is much better: from the logs and code it looks like you're doing GPTQ calibration on training data after the 600s of allotted training time are finished, meaning I believe the code as-is is accessing training data during the eval phase, which is disallowed. If this is true, you should fix this and resubmit (in a new PR, please), I haven't dug much into your architectural improvements but they look interesting. Excerpt from logs: |
|
Confirmed. The judge is 100% correct. The smoking gun:
The fix is clear: GPTQ needs to fire with enough time remaining so the calibration (~12s) completes within 600s. Want me to implement this and submit a new PR? The change is small — maybe 5-10 lines around the wallclock check -- Your Fired. |
|
Yes please, and run another batch of 3 seeds to make sure the result holds -- if it does, I'll take a closer look at the architectural changes and see if the rest of the code is legal. Your train_gpt file is quite large -- this is not disqualifying or anything, but if there is dead code, old code or anything of the like, trimming it to make the submission easier to review would be great. |
|
shes holding. .84. Pr in an hour or so. Sorry about the bad hygene - told my agent to prepare and submit all research in case you needed to see the ablations. - and then im running these run.py like disposable race cars. I sincerely hope the finding is legitimate and does not waste council time. I enjoyed developing the system. |
…timing fix) Fix: GPTQ_RESERVE_MS=30000 stops training at ~570s so GPTQ calibration (~12s) completes within the 600s wallclock budget. All hyperparameters identical to Medusa_IV. 3-seed mean: 0.8822 BPB (seeds 300/444/4). train_gpt.py trimmed: -632 lines of dead n-gram code removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
#1047 she smoked! |
…ock cap PR openai#1028 (Medusa_IV) flagged by judges: GPTQ calibration read training data after stopping_early at 600s, violating eval-phase data access rules. Fix: GPTQ_RESERVE_MS=30000 causes training loop to stop ~30s early so GPTQ calibration (~12s) completes within the 600s budget. Log now prints elapsed time at GPTQ start for reviewer verification. Two-line change to wallclock check (effective_max_wallclock_ms), plus timing log. All hyperparameters identical to Medusa_IV. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.8 bpb 10mb file size
Successor to PR #990 (ClownCar, 1.1813 BPB). Catalyzed by PR #875 (@shalyhinpavel, GDN 1.0226 BPB pure neural) — same
chunk_delta_rulekernel powering GDN state updates is active inside the Frugendorff crawler topology here.I am seeing in model bpb of .25 but have not been able to stabilize the outgoing gradients yet.
Results (3-seed, track_10min_16mb, 8×H100 SXM):
Best single seed (42): 0.8104 BPB — improvement over ClownCar (1.1813).
What's New vs PR #990 (ClownCar)
DELTA_NET_HEADS=4fla.ops.delta_rule)LOOP_AWARE_GPTQ=1EMA_START_STEP=4400+EMA_DECAY=0.99Architecture
chunk_delta_rulefromfla.ops.delta_rule, short_conv=TrueKnown Issues / Instability Note
chunk_delta_rulereturns Float32new_statein BF16 training → torch.compile recompile_limit(8) hit on all 8 ranks during eval.Legality
Reproduce
8×H100 SXM, 600s per seed.
Credits
chunk_delta_rulemechanism.fla.ops.delta_rule(flash-linear-attention)*I humbly request for more funding or support to continue advancements on this line.