Skip to content

INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed.#1047

Open
newjordan wants to merge 2 commits intoopenai:mainfrom
newjordan:medusa-unstable-s2
Open

INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed.#1047
newjordan wants to merge 2 commits intoopenai:mainfrom
newjordan:medusa-unstable-s2

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 29, 2026

medusas_gaze_compiled_v1

Summary

UNSTABLE PROBLEM SOLVED - DELTANET needs re-work as is.

With DeltaNEt OFF This submission scored a val_bpb:1.18234857 @ 10mb. Will try deltanet adaptions in a different manner, as this one was "learning to cheat better" every cycle (i think)

Legal resubmission of PR #1028 (Medusa: Unstable, 0.9984 BPB mean), which was flagged because GPTQ calibration read training data after the 600s wallclock cap.

Fix: GPTQ_RESERVE_MS=30000 stops the training loop at ~570s, so GPTQ calibration (~12s) completes within the 600s budget. Log confirms timing:

stopping_early: wallclock_cap train_time:570093ms step:4628/20000
gptq:loop-aware calibrated 41 layers in 11.5s

All hyperparameters identical to PR #1028 / Medusa_IV.

Results

Seed SW BPB Post-EMA BPB Int6 Roundtrip
300 1.0251 0.6484 0.8987
444 0.8469 0.4330 0.7159
4 0.7744 0.4339 0.6271
Mean 0.8822
Std dev ~0.105

3-seed mean improved from 0.9984 → 0.8822 vs the flagged submission.

Look at the EMA unravel, and the inconsistency! Its a puzzle waiting to be solved and the answer could be an invalid system.

Architecture

  • 4 flat layers + 1 crawler × 4 loops (Frugendorff), INST_DIM=32
  • DELTA_NET_HEADS=4, chunk_delta_rule (fla.ops.delta_rule)
  • Loop-aware 2-phase GPTQ (41 layers), int6+zstd
  • EMA_START_STEP=4400, EMA_DECAY=0.99
  • ~9.8MB artifact, 8xH100 SXM, 600s training

Code

train_gpt.py trimmed: ~632 lines of dead n-gram evaluation code removed (NGRAM_EVAL_ORDER=0 in this config). No logic changes.

Known Issues

High cross-seed variance (std dev ~0.105) from DeltaNet heads. Root causes identified (state dtype bug, quantization unravel through 4 crawler loops). Stabilization is active research.

Delta net black box information could need better integration.

Octavian and others added 2 commits March 28, 2026 19:52
…timing fix)

Fix: GPTQ_RESERVE_MS=30000 stops training at ~570s so GPTQ calibration
(~12s) completes within the 600s wallclock budget. All hyperparameters
identical to Medusa_IV. 3-seed mean: 0.8822 BPB (seeds 300/444/4).
train_gpt.py trimmed: -632 lines of dead n-gram code removed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Full seed 300 log now available: SW BPB 1.02508673, post_ema 0.6484,
roundtrip 0.8987, step 4628. Updated bytes_total to 9758873.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@newjordan newjordan changed the title Medusa: Unstable S2 — DeltaNet Crawler, Legal (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal (0.8822 BPB mean) 10mb. .77bpb single seed. Mar 29, 2026
@newjordan newjordan changed the title Medusa: Unstable S2 — DeltaNet Crawler, Legal (0.8822 BPB mean) 10mb. .77bpb single seed. (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. Mar 29, 2026
@Eppie
Copy link
Copy Markdown

Eppie commented Mar 29, 2026

It took me quite a few back-and-forths with Opus, who initially was convinced this was a legal/valid implementation, but I think I found a problem here. I'll let Opus explain:


The crawler block uses causal attention, so at the end of loop 1, position t's hidden state only knows about tokens 0..t. That's fine.

But the DeltaNet writes every position's hidden state into the shared matrix S, sequentially, across the full sequence. After loop 1 finishes, S contains information from all T positions.

Loop 2 starts with that same S. When position t reads from S in loop 2, it can retrieve information that was written by positions t+1, t+2, ..., T-1 during loop 1.

Position t is now making its prediction with knowledge of future tokens.

That's the violation. The DeltaNet state acts as a backdoor that bypasses the causal mask.

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

It took me quite a few back-and-forths with Opus, who initially was convinced this was a legal/valid implementation, but I think I found a problem here. I'll let Opus explain:


The crawler block uses causal attention, so at the end of loop 1, position t's hidden state only knows about tokens 0..t. That's fine.

But the DeltaNet writes every position's hidden state into the shared matrix S, sequentially, across the full sequence. After loop 1 finishes, S contains information from all T positions.

Loop 2 starts with that same S. When position t reads from S in loop 2, it can retrieve information that was written by positions t+1, t+2, ..., T-1 during loop 1.

Position t is now making its prediction with knowledge of future tokens.

That's the violation. The DeltaNet state acts as a backdoor that bypasses the causal mask.

Maybe, delta net is based on a triton kernal and that's kinda how they work, the crawler is def stable
On its own and compresses with 94% accuracy on initial seed testing. The triton kernels are a black box that pass data. Question is if that data is forward looking, in my opinion(or corrupt), it's not. Even if it was, that's a small adjustment to a variable and the system still needs to be tested etc. the basic architecture works outside of some small variable in the triton node...I would argue there are two cool things happening here and neither are even close to optimized. Unstable concoction right here

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

In reference to the most excellent DeltaNet submission by Pavel - #875

Bug 1 — BPB hardcoding 3.5:
Their code: val_loss / (math.log(2) * 3.5). Ours at line 2489:
val_bpb = val_loss / math.log(2.0) * (token_count / max(byte_count, 1.0))
We compute actual token→byte ratios from sentencepiece LUTs (base_bytes_lut, has_leading_space_lut,
is_boundary_token_lut). CLEAN.

Bug 2 — Single val shard:
Their code: np.memmap(files[0], ...). Ours at line 340:
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
All shards concatenated. CLEAN.

Bug 3 — No byte-level accounting:
We built the full sentencepiece LUT stack and pass it through every eval path. CLEAN.

Bug 4 — Single training shard:
They initialize and never advance. Our TokenStream.take() (line 552–564) does this:
while remaining > 0:
avail = self.tokens.numel() - self.pos
if avail <= 0:
self._advance_file() # advances file_idx % len(files), reloads
continue
...
All shards cycle through. CLEAN.


Bug 5 — DeltaNet state kwarg: FLAGGED

Our call at line 1445:
o, new_state = _fla_chunk_delta_rule(
q=q, k=k, v=v, beta=beta,
initial_state=state, # <-- is this the right kwarg name?
output_final_state=True,
)

initial_state (singular). We're clean on all five bugs.

Our DeltaNet state threading is correct — loop N's final state is properly seeded into loop N+1 exactly as
designed.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

The PR this is built on has multiple issues that have not been addressed, enabling it to have a lower BPB than otherwise with a proper submission.

These bugs propagate in your code too, even though some have been fixed, and some of these fixes break causality.
@Eppie is correct that the DeltaNet inter-loop state breaks causality (DELTA_NET_HEADS > 0)
After loop 1, the DeltaNet state S contains writes from all positions 0..T-1. In loop 2, position t reads from S and retrieves information written by positions t+1..T-1 during loop 1. This is a causal violation as future token information leaks through the carried state.

Artifact_ngram is disabled in this submission, at least according to logs, but this one would build an eval-time oracle from training data which would use the training corpus at inference time which would be illegal according to rules.

There are 2 issues around the data processing:

  1. The prefill shard skips the header offset in both the TrainNgramOracle functions. Reads from byte 0, ingesting the 1024-byte shard header (256 × int32) as 512 garbage uint16 tokens into the hash tables. load_data_shard correctly uses offset=header_bytes, but both oracle prefill methods don't. Corrupts 40,960 entries across 80 shards. Inactive because the mixer oracle is disabled in this run (as mentioned above).
  2. You have an issue where 2047 tail tokens are dropped from the validation dataset, so you are not currently training on the whole val dataset.

2 things that look suspicious:

  1. The Sliding Window eval is WORSE than without, this shouldn't happen. The score is 23% higher which showcases there is an issue with the scoring function.
  2. EMA gets enabled towards the end, and does only 642 steps, within 642 steps it drops the loss by 56%, which is impossible. The traing loss tells the same story, with 2.15 at step 3500, 1.50 at step 4000, 0.88 at step 4500. Your code pretty much halves its loss every 500 steps, which is impossible.
  3. Another issue is that it looks like you are mixing MTP (multi token prediction) with Medusa Heads. Medusa Heads are used for speculative decoding at inference time, MTP are used to improve training. You seem to be applying a weird mix between the 2, where you apply the speculative decoding approach but during training as a training auxiliary. Each mtp_head is an independent CastedLinear that takes the same backbone output x and predicts a different future offset. No chaining between heads. This is Medusa, not MTP. As a training auxiliary, this is weak as predicting t+2 directly from the hidden state at t without knowing t+1 is a much noisier signal than chained prediction. The backbone doesn't learn sequential planning, it just gets a blurry "also try to predict further ahead from the same features" gradient, the value as a training regularizer is marginal at best, and proper Medusa would only help during inference time, which is not an issue for this competition as there is enough time for the evaluation to complete.

The bugs have to be fixed, and whatever is happening with EMA, evaluation and the MTP/medusa mix have to be properly investigated.

@newjordan
Copy link
Copy Markdown
Author

The PR this is built on has multiple issues that have not been addressed, enabling it to have a lower BPB than otherwise with a proper submission.

These bugs propagate in your code too, even though some have been fixed, and some of these fixes break causality. @Eppie is correct that the DeltaNet inter-loop state breaks causality (DELTA_NET_HEADS > 0) After loop 1, the DeltaNet state S contains writes from all positions 0..T-1. In loop 2, position t reads from S and retrieves information written by positions t+1..T-1 during loop 1. This is a causal violation as future token information leaks through the carried state.

Artifact_ngram is disabled in this submission, at least according to logs, but this one would build an eval-time oracle from training data which would use the training corpus at inference time which would be illegal according to rules.

There are 2 issues around the data processing:

  1. The prefill shard skips the header offset in both the TrainNgramOracle functions. Reads from byte 0, ingesting the 1024-byte shard header (256 × int32) as 512 garbage uint16 tokens into the hash tables. load_data_shard correctly uses offset=header_bytes, but both oracle prefill methods don't. Corrupts 40,960 entries across 80 shards. Inactive because the mixer oracle is disabled in this run (as mentioned above).
  2. You have an issue where 2047 tail tokens are dropped from the validation dataset, so you are not currently training on the whole val dataset.

2 things that look suspicious:

  1. The Sliding Window eval is WORSE than without, this shouldn't happen. The score is 23% higher which showcases there is an issue with the scoring function.
  2. EMA gets enabled towards the end, and does only 642 steps, within 642 steps it drops the loss by 56%, which is impossible. The traing loss tells the same story, with 2.15 at step 3500, 1.50 at step 4000, 0.88 at step 4500. Your code pretty much halves its loss every 500 steps, which is impossible.
  3. Another issue is that it looks like you are mixing MTP (multi token prediction) with Medusa Heads. Medusa Heads are used for speculative decoding at inference time, MTP are used to improve training. You seem to be applying a weird mix between the 2, where you apply the speculative decoding approach but during training as a training auxiliary. Each mtp_head is an independent CastedLinear that takes the same backbone output x and predicts a different future offset. No chaining between heads. This is Medusa, not MTP. As a training auxiliary, this is weak as predicting t+2 directly from the hidden state at t without knowing t+1 is a much noisier signal than chained prediction. The backbone doesn't learn sequential planning, it just gets a blurry "also try to predict further ahead from the same features" gradient, the value as a training regularizer is marginal at best, and proper Medusa would only help during inference time, which is not an issue for this competition as there is enough time for the evaluation to complete.

The bugs have to be fixed, and whatever is happening with EMA, evaluation and the MTP/medusa mix have to be properly investigated.

its called medusa because of my artwork, not from medusa heads =p. Ill run your critique throught the blender and see what comes up. thanks.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

its called medusa because of my artwork, not from medusa heads =p. Ill run your critique throught the blender and see what comes up. thanks.

My comment was not about the artwork or the name, my comment was that you are doing MTP wrong, which is a mix between actual medusa heads and MTP. I referenced the actual code.

I recommend to properly review the AI code before submitting, as anything under 1.10 will have increased scrutiny, and anything under 1.00 will have some form of bug to enable it to be so low. There are already a plethora of illegal PRs, it's best others do not add to the list. The competition appreaciates quality, ultimately, there is no prize for having the lowest BPB, but you'd definitely be more appreaciated for a better overall submission.

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

its called medusa because of my artwork, not from medusa heads =p. Ill run your critique throught the blender and see what comes up. thanks.

My comment was not about the artwork or the name, my comment was that you are doing MTP wrong, which is a mix between actual medusa heads and MTP. I referenced the actual code.

I recommend to properly review the AI code before submitting, as anything under 1.10 will have increased scrutiny, and anything under 1.00 will have some form of bug to enable it to be so low. There are already a plethora of illegal PRs, it's best others do not add to the list. The competition appreaciates quality, ultimately, there is no prize for having the lowest BPB, but you'd definitely be more appreaciated for a better overall submission.


Bug 1: DeltaNet inter-loop causality violation — YES, WE HAVE THIS

@Eppie is correct. Trace the code in _run_crawler (lines 1665–1681):

delta_state = torch.zeros(...) # zeros before loop 1

for loop in range(4): # 4 iterations
...
x_loop, delta_state = self.delta_net(x_loop, delta_state)
x = x_loop

After loop 1, delta_state = final state of chunk_delta_rule over the full T-token sequence — it's a compressed
summary of all positions 0..T-1. In loop 2, initial_state=delta_state is passed. The kernel processes causally
left-to-right, but position t in loop 2 initializes from a state that already absorbed writes from positions
t+1..T-1 during loop 1. Future tokens leak into every prediction.

This is not a theoretical concern. It explains something anomalous in our own numbers:

┌───────────┬────────────────┬─────────────────────┬─────────────┐
│ Run │ Roundtrip INT6 │ Sliding Window INT6 │ SW/RT ratio │
├───────────┼────────────────┼─────────────────────┼─────────────┤
│ CC_II │ ~0.9340 │ 1.0427 │ +11.7% │
├───────────┼────────────────┼─────────────────────┼─────────────┤
│ Medusa_II │ ~0.8555 │ 1.0366 │ +21.1% │
└───────────┴────────────────┴─────────────────────┴─────────────┘

Sliding window should be better than roundtrip — scored tokens have 960 tokens of context vs ~512 average for
roundtrip. The inversion is the fingerprint of the causality violation: the DeltaNet look-ahead
disproportionately helps early positions (position 0 in loop 2 effectively sees the whole sequence), which
dominate the roundtrip average. Sliding window only scores the last 64 tokens of each window where genuine
causal context is already rich and the look-ahead adds less. So roundtrip is artificially deflated; sliding
window is closer to truth. -- Evertyhign elser came back solid. fixing this now. now.. there are no medusa heads in this.. its not whatever that is. TY I am actively trying to stabilize this and the numbers dont make sensed. the compression is amazing, but the deltanet is unstable so far.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

You continue using the AI only as a matter of evaluation, just because the agent has not found an issue, it does not mean there are none. MTP, DeltaNet, Sliding Window Eval and EMA are all broken as they stand, whether or not your AI agrees. Do not put 100% of your trust in the LLM you are using as it is not able to evaluate every trace, which would only be noted with proper traces throughout the code, fed into it.

I'm sure everyone in this competition has been using AI to some extent, it is a given and it genuinely helps, but you cannot trust the system fully given the complexity of the code. Patching 1 thing every time there is a comment will not solve this. Moreover, the variation you have in scores (according to your table) is also suspicious, to give you an example, other submissions have as low as 0.0005 (and around), you have over 200 times more. It being "unstable" as you are mentioning, showcases there are genuine issues with the pipeline. Please evaluate your code properly before you use more compute for the proper 3-seeds runs, it would help not use a lot of money/credits.

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

You continue using the AI only as a matter of evaluation, just because the agent has not found an issue, it does not mean there are none. MTP, DeltaNet, Sliding Window Eval and EMA are all broken as they stand, whether or not your AI agrees. Do not put 100% of your trust in the LLM you are using as it is not able to evaluate every trace, which would only be noted with proper traces throughout the code, fed into it.

I'm sure everyone in this competition has been using AI to some extent, it is a given and it genuinely helps, but you cannot trust the system fully given the complexity of the code. Patching 1 thing every time there is a comment will not solve this. Moreover, the variation you have in scores (according to your table) is also suspicious, to give you an example, other submissions have as low as 0.0005 (and around), you have over 200 times more. It being "unstable" as you are mentioning, showcases there are genuine issues with the pipeline. Please evaluate your code properly before you use more compute for the proper 3-seeds runs, it would help not use a lot of money/credits.

im not a calculator bro. Im a hard worker, this is how I push results. if you find a problem great! ill fix it asap. but I didnt even go to college, im not properly educated, and Im working my ass off with the tools I have. Bugs will be caught, progress will happen. Looks like DeltaNet needs major architectural work outside of the fact I use AI to push results... and this is what i was hoping to fgure out by releasign an unstable submission. so ty very much. Im very bored of chasing the lead with current methods so enjoy doing weird stuff. no way that happens without AI to an extreme imo

@newjordan newjordan changed the title (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. INVALID MOST LIKELY (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. Mar 29, 2026
@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

There is no issue with the use of AI as per my previous message, nor is anyone unhappy with what you are doing, and trust that everyone partaking in this competition is working their gluteus maximus off. And the fact you do this without field education is commendable, don't get me wrong, I'm just saying that you can use AI to learn with it, instead of having AI do most of the work, as it will get things wrong a lot of the time, they are not perfect. That way you have a great submission, you learn a lot by having the AI explain things, and then are even able to review issues yourself so a submission is not invalid.

And totally agreed that doing crazy things is better than chasing the bpb lead in the main leaderboard, that's why there is a 2nd leaderboard. And trust that even the OAI team will be happier with something that is bad bpb wise but is unique. And for that you have 1 month remaining to do it slowly and well. They have mentioned that just because a PR is raised, it does not mean it will get accepted, so take your time to do something good, and do write your findings as you go along. Both the community and OAI will appreaciate that more.

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

There is no issue with the use of AI as per my previous message, nor is anyone unhappy with what you are doing, and trust that everyone partaking in this competition is working their gluteus maximus off. And the fact you do this without field education is commendable, don't get me wrong, I'm just saying that you can use AI to learn with it, instead of having AI do most of the work, as it will get things wrong a lot of the time, they are not perfect. That way you have a great submission, you learn a lot by having the AI explain things, and then are even able to review issues yourself so a submission is not invalid.

And totally agreed that doing crazy things is better than chasing the bpb lead in the main leaderboard, that's why there is a 2nd leaderboard. And trust that even the OAI team will be happier with something that is bad bpb wise but is unique. And for that you have 1 month remaining to do it slowly and well. They have mentioned that just because a PR is raised, it does not mean it will get accepted, so take your time to do something good, and do write your findings as you go along. Both the community and OAI will appreaciate that more.

I am not trying to be defensive, its jsut like.. you think Im one shotting these? and not active problem solving? I completely understand what your saying. Trying to do my best at real science, not vibing. no vibing. My method is test concept locally on spark, then go to single GPU for long run, and if it shows promise rent the stack. My problem is constanly seeing improvements and staying engaged. Everyones talking about my deltanet mistake, nobody is talking about how I pushed 40% compression at 94% accuracy, and the BPB is just a tack on extra... If I can get the compression mechanism to ALSO boost bpb... then the results will be there. The whole problem for me is turning the clown car into not jsut a cmopression mechanism.. thats what this is. How to modify the crawler to boost bpb

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 29, 2026

SHES BUSTED - INSTABILITY IN DELTA NET CONFIRMED. A FIXED DELTA HAS NO IMPROVMENT ATM. ONLY WIN HERE IS FILE SIZE OF 9.9mb from clown car crawler....

TY VERY MUCH TO ANYONE WHO LOOKED AT THIS AND HELPED DIAGNOSE.

│ Run │ DN │ EMA steps │ Roundtrip │ Sliding Window │ SW < RT? │

│ Medusa_IV s300 │ 4 (violation) │ ~3000 │ ~0.85 │ 0.9578 │ ❌ │

│ Medusa_VII DN=0 │ 0 │ 3088 │ 1.2065 │ 1.1823 │ ✅ │

│ Medusa_VII DN=4 fixed │ 4 (causal) │ 494 │ 1.2204 │ 1.1958 │ ✅ │

@newjordan newjordan changed the title INVALID MOST LIKELY (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants