INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed.#1047
INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed.#1047newjordan wants to merge 2 commits intoopenai:mainfrom
Conversation
…timing fix) Fix: GPTQ_RESERVE_MS=30000 stops training at ~570s so GPTQ calibration (~12s) completes within the 600s wallclock budget. All hyperparameters identical to Medusa_IV. 3-seed mean: 0.8822 BPB (seeds 300/444/4). train_gpt.py trimmed: -632 lines of dead n-gram code removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Full seed 300 log now available: SW BPB 1.02508673, post_ema 0.6484, roundtrip 0.8987, step 4628. Updated bytes_total to 9758873. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
It took me quite a few back-and-forths with Opus, who initially was convinced this was a legal/valid implementation, but I think I found a problem here. I'll let Opus explain: The crawler block uses causal attention, so at the end of loop 1, position But the DeltaNet writes every position's hidden state into the shared matrix Loop 2 starts with that same Position That's the violation. The DeltaNet state acts as a backdoor that bypasses the causal mask. |
Maybe, delta net is based on a triton kernal and that's kinda how they work, the crawler is def stable |
|
In reference to the most excellent DeltaNet submission by Pavel - #875 Bug 1 — BPB hardcoding 3.5: Bug 2 — Single val shard: Bug 3 — No byte-level accounting: Bug 4 — Single training shard: Bug 5 — DeltaNet state kwarg: FLAGGED Our call at line 1445:
|
|
The PR this is built on has multiple issues that have not been addressed, enabling it to have a lower BPB than otherwise with a proper submission. These bugs propagate in your code too, even though some have been fixed, and some of these fixes break causality. Artifact_ngram is disabled in this submission, at least according to logs, but this one would build an eval-time oracle from training data which would use the training corpus at inference time which would be illegal according to rules. There are 2 issues around the data processing:
2 things that look suspicious:
The bugs have to be fixed, and whatever is happening with EMA, evaluation and the MTP/medusa mix have to be properly investigated. |
its called medusa because of my artwork, not from medusa heads =p. Ill run your critique throught the blender and see what comes up. thanks. |
My comment was not about the artwork or the name, my comment was that you are doing MTP wrong, which is a mix between actual medusa heads and MTP. I referenced the actual code. I recommend to properly review the AI code before submitting, as anything under 1.10 will have increased scrutiny, and anything under 1.00 will have some form of bug to enable it to be so low. There are already a plethora of illegal PRs, it's best others do not add to the list. The competition appreaciates quality, ultimately, there is no prize for having the lowest BPB, but you'd definitely be more appreaciated for a better overall submission. |
Bug 1: DeltaNet inter-loop causality violation — YES, WE HAVE THIS @Eppie is correct. Trace the code in _run_crawler (lines 1665–1681): delta_state = torch.zeros(...) # zeros before loop 1 for loop in range(4): # 4 iterations After loop 1, delta_state = final state of chunk_delta_rule over the full T-token sequence — it's a compressed This is not a theoretical concern. It explains something anomalous in our own numbers: ┌───────────┬────────────────┬─────────────────────┬─────────────┐ Sliding window should be better than roundtrip — scored tokens have 960 tokens of context vs ~512 average for |
|
You continue using the AI only as a matter of evaluation, just because the agent has not found an issue, it does not mean there are none. MTP, DeltaNet, Sliding Window Eval and EMA are all broken as they stand, whether or not your AI agrees. Do not put 100% of your trust in the LLM you are using as it is not able to evaluate every trace, which would only be noted with proper traces throughout the code, fed into it. I'm sure everyone in this competition has been using AI to some extent, it is a given and it genuinely helps, but you cannot trust the system fully given the complexity of the code. Patching 1 thing every time there is a comment will not solve this. Moreover, the variation you have in scores (according to your table) is also suspicious, to give you an example, other submissions have as low as 0.0005 (and around), you have over 200 times more. It being "unstable" as you are mentioning, showcases there are genuine issues with the pipeline. Please evaluate your code properly before you use more compute for the proper 3-seeds runs, it would help not use a lot of money/credits. |
im not a calculator bro. Im a hard worker, this is how I push results. if you find a problem great! ill fix it asap. but I didnt even go to college, im not properly educated, and Im working my ass off with the tools I have. Bugs will be caught, progress will happen. Looks like DeltaNet needs major architectural work outside of the fact I use AI to push results... and this is what i was hoping to fgure out by releasign an unstable submission. so ty very much. Im very bored of chasing the lead with current methods so enjoy doing weird stuff. no way that happens without AI to an extreme imo |
|
There is no issue with the use of AI as per my previous message, nor is anyone unhappy with what you are doing, and trust that everyone partaking in this competition is working their gluteus maximus off. And the fact you do this without field education is commendable, don't get me wrong, I'm just saying that you can use AI to learn with it, instead of having AI do most of the work, as it will get things wrong a lot of the time, they are not perfect. That way you have a great submission, you learn a lot by having the AI explain things, and then are even able to review issues yourself so a submission is not invalid. And totally agreed that doing crazy things is better than chasing the bpb lead in the main leaderboard, that's why there is a 2nd leaderboard. And trust that even the OAI team will be happier with something that is bad bpb wise but is unique. And for that you have 1 month remaining to do it slowly and well. They have mentioned that just because a PR is raised, it does not mean it will get accepted, so take your time to do something good, and do write your findings as you go along. Both the community and OAI will appreaciate that more. |
I am not trying to be defensive, its jsut like.. you think Im one shotting these? and not active problem solving? I completely understand what your saying. Trying to do my best at real science, not vibing. no vibing. My method is test concept locally on spark, then go to single GPU for long run, and if it shows promise rent the stack. My problem is constanly seeing improvements and staying engaged. Everyones talking about my deltanet mistake, nobody is talking about how I pushed 40% compression at 94% accuracy, and the BPB is just a tack on extra... If I can get the compression mechanism to ALSO boost bpb... then the results will be there. The whole problem for me is turning the clown car into not jsut a cmopression mechanism.. thats what this is. How to modify the crawler to boost bpb |
|
SHES BUSTED - INSTABILITY IN DELTA NET CONFIRMED. A FIXED DELTA HAS NO IMPROVMENT ATM. ONLY WIN HERE IS FILE SIZE OF 9.9mb from clown car crawler.... TY VERY MUCH TO ANYONE WHO LOOKED AT THIS AND HELPED DIAGNOSE. │ Run │ DN │ EMA steps │ Roundtrip │ Sliding Window │ SW < RT? │ │ Medusa_IV s300 │ 4 (violation) │ ~3000 │ ~0.85 │ 0.9578 │ ❌ │ │ Medusa_VII DN=0 │ 0 │ 3088 │ 1.2065 │ 1.1823 │ ✅ │ │ Medusa_VII DN=4 fixed │ 4 (causal) │ 494 │ 1.2204 │ 1.1958 │ ✅ │ |
Summary
UNSTABLE PROBLEM SOLVED - DELTANET needs re-work as is.
With DeltaNEt OFF This submission scored a val_bpb:1.18234857 @ 10mb. Will try deltanet adaptions in a different manner, as this one was "learning to cheat better" every cycle (i think)
Legal resubmission of PR #1028 (Medusa: Unstable, 0.9984 BPB mean), which was flagged because GPTQ calibration read training data after the 600s wallclock cap.
Fix:
GPTQ_RESERVE_MS=30000stops the training loop at ~570s, so GPTQ calibration (~12s) completes within the 600s budget. Log confirms timing:All hyperparameters identical to PR #1028 / Medusa_IV.
Results
3-seed mean improved from 0.9984 → 0.8822 vs the flagged submission.
Look at the EMA unravel, and the inconsistency! Its a puzzle waiting to be solved and the answer could be an invalid system.
Architecture
chunk_delta_rule(fla.ops.delta_rule)Code
train_gpt.pytrimmed: ~632 lines of dead n-gram evaluation code removed (NGRAM_EVAL_ORDER=0 in this config). No logic changes.Known Issues
High cross-seed variance (std dev ~0.105) from DeltaNet heads. Root causes identified (state dtype bug, quantization unravel through 4 crawler loops). Stabilization is active research.
Delta net black box information could need better integration.