Skip to content

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875

Open
shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
shalyhinpavel:main
Open

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875
shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
shalyhinpavel:main

Conversation

@shalyhinpavel
Copy link
Copy Markdown

Summary

New SOTA record using Gated DeltaNet (GDN) architecture, achieving 1.0226 BPB (average over 3 runs). This is a pure neural approach without any test-time training (TTT) or external caching.

Key Changes

  • Architecture: Replaced standard Attention with GatedDeltaBlock (Gated DeltaNet) layers.
  • Optimization: Implemented a dynamic batch size and chunk size curriculum based on elapsed time.
  • Data Loading: Switched to FastLoader with non-blocking prefetching and pinning.
  • Efficiency: Achieved significant BPB improvement (~0.0968 nats) over the previous leaderboard record.

Reproducibility

  • Artifact Size: Under 16MB (zlib + int8 quantization).
  • Training Time: ~10 minutes on 8xH100.
  • Logs: 3 full runs provided in records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/train.log.

@newjordan
Copy link
Copy Markdown

Aaaaand This is my homework for the day. Well done!

@shalyhinpavel
Copy link
Copy Markdown
Author

Aaaaand This is my homework for the day. Well done!

You're welcome!
Please note that there is still a lot of untapped potential here since I launched this on only 1 H100 and deliberately left some space for attempts to improve.

@newjordan
Copy link
Copy Markdown

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

@shalyhinpavel
Copy link
Copy Markdown
Author

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

Thank you, I'm glad there are people who see this and understand the essence, hahaha. Indeed, I took a lot of references from biology at both the macro and micro levels. Regarding brute-forcing, I'm sure it's necessary, as optimization is the next and natural part of the evolution of both biological and artificial mechanisms. I'll be in touch!

@newjordan
Copy link
Copy Markdown

bro.. you goot recurrence to work.. I have a version thats an insane compressor, like insane, but its not moving my bpb much. there is a chance... very very exciting resaerch for me for at least 3 or 4 days on this.

manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 28, 2026
8L (7 DeltaNet + 1 Attention), 384d, O(n) linear attention.
Base: PR openai#875 (1.0226 BPB). Added: EMA(0.997), cosine warmdown,
per-row int8 + LZMA, proper SentencePiece BPB eval, Score-First TTT.
507 lines, 32KB. Target: sub-1.0 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Eppie
Copy link
Copy Markdown

Eppie commented Mar 28, 2026

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

@shalyhinpavel
Copy link
Copy Markdown
Author

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

Hello! I hardcoded the eval loop to 50 batches (~819k tokens) simply to speed up local iterations and ensure the entire script (train + pack + evaluate) stays comfortably within the strict 10-minute wall-clock limit during development.
The variance on FineWeb-10B is usually negligible after the first million tokens, so it serves as an excellent proxy. However, you are absolutely right for the official benchmark rigor. I'll patch the eval loop to consume the entire val.bin file and update the exact BPB numbers shortly. Expect them to be extremely close. Thanks for the code review!

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
4 bio concepts redesigned for DeltaNet chunk seams:
- Astrocyte: seam controller gates erase/write per chunk activity
- Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck
- Clonal Selection: top-K specialist state amplification at seams
- Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in

Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9.
Implementation order defined. Not copying PR openai#875 code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
…0.9984, std=0.1724

Seeds: 42 (0.8104 SW), 300 (0.9578 SW), 1337 (1.2269 SW). Includes unravel A/B
diagnostic scripts from Medusa_II (all variants tied at 1.0047 — checkpoint-level
fragility, not GPTQ config). DeltaNet heads introduce significant cross-seed
variance vs ClownCar (0.00015). Successor to PR openai#990, catalyzed by PR openai#875.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NoesisGenesis
Copy link
Copy Markdown

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

  1. BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.

  2. As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.

  3. No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

  1. FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!

  2. The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.


The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Successor to PR openai#990 (ClownCar, 1.1813 BPB). Catalyzed by PR openai#875
(@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (chunk_delta_rule),
loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99).
4 flat + 1 crawler x 4 loops, INST_DIM=32, 8xH100 SXM.

Seeds: 42=0.8104, 300=0.9578, 1337=1.2269 SW BPB. High cross-seed variance
(std=0.1724 vs ClownCar 0.00015) — stabilization ongoing in Medusa_VII.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shalyhinpavel
Copy link
Copy Markdown
Author

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

  1. BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.
  2. As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.
  3. No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

  1. FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!
  2. The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.

The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

Thank you for the rigorous, line-by-line audit. This is exactly the kind of red-teaming I was hoping for when open-sourcing this early MVP, and you are 100% correct on all points:
The files[0] hardcode in FastLoader was a brutal oversight, effectively causing the model to overfit on a single shard.
The state vs past_key_values kwarg mismatch silently killed the cross-chunk recurrence. The model was operating completely blind beyond the chunk size.
The 3.5 BPB constant and 50-batch loop were temporary local-testing crutches that shouldn't have made it into the final submission.
I agree that the reported 1.0226 BPB cannot be treated as an official SOTA given the evaluation and plumbing bugs. I will leave this PR open as a conceptual baseline for the Gated DeltaNet approach, but I concede the leaderboard claim for this specific artifact.
The irony here is that after patching the multi-sharding and correctly passing the recurrent state locally, the architecture's throughput literally doubled and the loss curve steepened significantly (even when constrained to a single A100). The underlying structural thesis holds, but it desperately needed these exact plumbing fixes to breathe properly.
I am currently doing a ground-up rewrite of the evaluation pipeline (incorporating proper SentencePiece byte-accounting) and the recurrence plumbing. I will submit a mathematically and architecturally sound V2 shortly.
I deeply appreciate the deep dive. This level of scrutiny makes the whole competition better.

@newjordan
Copy link
Copy Markdown

newjordan commented Mar 29, 2026

I'll double check my implementation again to make sure my delta-net doesnt have same issues. I do believe the concept to be solid if implementations had bugs, so its just a fine tuning issue, nothing wrong with the goa; Maybe this can help you Pavel - Bugs 1–4: We don't have them

Bug 1 — BPB hardcoding 3.5:
Their code: val_loss / (math.log(2) * 3.5). Ours at line 2489:
val_bpb = val_loss / math.log(2.0) * (token_count / max(byte_count, 1.0))
We compute actual token→byte ratios from sentencepiece LUTs (base_bytes_lut, has_leading_space_lut,
is_boundary_token_lut). CLEAN.

Bug 2 — Single val shard:
Their code: np.memmap(files[0], ...). Ours at line 340:
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
All shards concatenated. CLEAN.

Bug 3 — No byte-level accounting:
We built the full sentencepiece LUT stack and pass it through every eval path. CLEAN.

Bug 4 — Single training shard:
They initialize and never advance. Our TokenStream.take() (line 552–564) does this:
while remaining > 0:
avail = self.tokens.numel() - self.pos
if avail <= 0:
self._advance_file() # advances file_idx % len(files), reloads
continue
...
All shards cycle through. CLEAN.


Bug 5 — DeltaNet state kwarg: FLAGGED

This is the one that could hit us. Our call at line 1445:
o, new_state = _fla_chunk_delta_rule(
q=q, k=k, v=v, beta=beta,
initial_state=state, # <-- is this the right kwarg name?
output_final_state=True,
)

-- initial_state (singular). We're clean on all five bugs.

Our DeltaNet state threading is correct — loop N's final state is properly seeded into loop N+1 exactly as
designed.

-- Pavel your contribution was genius, it just had some bugs. OOFC I use AI agents for rapid development, and I could be wrong, But I did just smoke test these claims and the results still hold. Deltanet is a huge upsiade and you deserve the clean SOTA submission for this week imo even though there were bugs. It unlocked a completely volitile system on my end which means this is just the tip of the iceburg!

@shalyhinpavel
Copy link
Copy Markdown
Author

Appreciate the backup, @newjordan! Respect for checking the plumbing on your end. The initial_state kwarg in the raw Triton kernel is definitely the way to go — much cleaner than the high-level wrapper I was messing with in V1.
Keep wrestling with that volatile crawler! I’m currently finalizing a different approach to stabilize the recursion without needing reverse gradient funnels (focusing on the forward-pass readout dynamics instead). Will drop V2/V3 as soon as my compute clears up. Let's push this sub-1.00 frontier together! 🤝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants