Skip to content

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875

Open
shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
shalyhinpavel:main
Open

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875
shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
shalyhinpavel:main

Conversation

@shalyhinpavel
Copy link
Copy Markdown

Summary

New SOTA record using Gated DeltaNet (GDN) architecture, achieving 1.0226 BPB (average over 3 runs). This is a pure neural approach without any test-time training (TTT) or external caching.

Key Changes

  • Architecture: Replaced standard Attention with GatedDeltaBlock (Gated DeltaNet) layers.
  • Optimization: Implemented a dynamic batch size and chunk size curriculum based on elapsed time.
  • Data Loading: Switched to FastLoader with non-blocking prefetching and pinning.
  • Efficiency: Achieved significant BPB improvement (~0.0968 nats) over the previous leaderboard record.

Reproducibility

  • Artifact Size: Under 16MB (zlib + int8 quantization).
  • Training Time: ~10 minutes on 8xH100.
  • Logs: 3 full runs provided in records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/train.log.

@newjordan
Copy link
Copy Markdown

Aaaaand This is my homework for the day. Well done!

@shalyhinpavel
Copy link
Copy Markdown
Author

Aaaaand This is my homework for the day. Well done!

You're welcome!
Please note that there is still a lot of untapped potential here since I launched this on only 1 H100 and deliberately left some space for attempts to improve.

@newjordan
Copy link
Copy Markdown

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

@shalyhinpavel
Copy link
Copy Markdown
Author

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

Thank you, I'm glad there are people who see this and understand the essence, hahaha. Indeed, I took a lot of references from biology at both the macro and micro levels. Regarding brute-forcing, I'm sure it's necessary, as optimization is the next and natural part of the evolution of both biological and artificial mechanisms. I'll be in touch!

@newjordan
Copy link
Copy Markdown

bro.. you goot recurrence to work.. I have a version thats an insane compressor, like insane, but its not moving my bpb much. there is a chance... very very exciting resaerch for me for at least 3 or 4 days on this.

manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Mar 28, 2026
8L (7 DeltaNet + 1 Attention), 384d, O(n) linear attention.
Base: PR openai#875 (1.0226 BPB). Added: EMA(0.997), cosine warmdown,
per-row int8 + LZMA, proper SentencePiece BPB eval, Score-First TTT.
507 lines, 32KB. Target: sub-1.0 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Eppie
Copy link
Copy Markdown

Eppie commented Mar 28, 2026

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

@shalyhinpavel
Copy link
Copy Markdown
Author

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

Hello! I hardcoded the eval loop to 50 batches (~819k tokens) simply to speed up local iterations and ensure the entire script (train + pack + evaluate) stays comfortably within the strict 10-minute wall-clock limit during development.
The variance on FineWeb-10B is usually negligible after the first million tokens, so it serves as an excellent proxy. However, you are absolutely right for the official benchmark rigor. I'll patch the eval loop to consume the entire val.bin file and update the exact BPB numbers shortly. Expect them to be extremely close. Thanks for the code review!

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
4 bio concepts redesigned for DeltaNet chunk seams:
- Astrocyte: seam controller gates erase/write per chunk activity
- Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck
- Clonal Selection: top-K specialist state amplification at seams
- Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in

Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9.
Implementation order defined. Not copying PR openai#875 code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
…0.9984, std=0.1724

Seeds: 42 (0.8104 SW), 300 (0.9578 SW), 1337 (1.2269 SW). Includes unravel A/B
diagnostic scripts from Medusa_II (all variants tied at 1.0047 — checkpoint-level
fragility, not GPTQ config). DeltaNet heads introduce significant cross-seed
variance vs ClownCar (0.00015). Successor to PR openai#990, catalyzed by PR openai#875.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NoesisGenesis
Copy link
Copy Markdown

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

  1. BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.

  2. As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.

  3. No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

  1. FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!

  2. The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.


The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Successor to PR openai#990 (ClownCar, 1.1813 BPB). Catalyzed by PR openai#875
(@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (chunk_delta_rule),
loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99).
4 flat + 1 crawler x 4 loops, INST_DIM=32, 8xH100 SXM.

Seeds: 42=0.8104, 300=0.9578, 1337=1.2269 SW BPB. High cross-seed variance
(std=0.1724 vs ClownCar 0.00015) — stabilization ongoing in Medusa_VII.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shalyhinpavel
Copy link
Copy Markdown
Author

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

  1. BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.
  2. As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.
  3. No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

  1. FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!
  2. The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.

The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

Thank you for the rigorous, line-by-line audit. This is exactly the kind of red-teaming I was hoping for when open-sourcing this early MVP, and you are 100% correct on all points:
The files[0] hardcode in FastLoader was a brutal oversight, effectively causing the model to overfit on a single shard.
The state vs past_key_values kwarg mismatch silently killed the cross-chunk recurrence. The model was operating completely blind beyond the chunk size.
The 3.5 BPB constant and 50-batch loop were temporary local-testing crutches that shouldn't have made it into the final submission.
I agree that the reported 1.0226 BPB cannot be treated as an official SOTA given the evaluation and plumbing bugs. I will leave this PR open as a conceptual baseline for the Gated DeltaNet approach, but I concede the leaderboard claim for this specific artifact.
The irony here is that after patching the multi-sharding and correctly passing the recurrent state locally, the architecture's throughput literally doubled and the loss curve steepened significantly (even when constrained to a single A100). The underlying structural thesis holds, but it desperately needed these exact plumbing fixes to breathe properly.
I am currently doing a ground-up rewrite of the evaluation pipeline (incorporating proper SentencePiece byte-accounting) and the recurrence plumbing. I will submit a mathematically and architecturally sound V2 shortly.
I deeply appreciate the deep dive. This level of scrutiny makes the whole competition better.

@newjordan
Copy link
Copy Markdown

newjordan commented Mar 29, 2026

I'll double check my implementation again to make sure my delta-net doesnt have same issues. I do believe the concept to be solid if implementations had bugs, so its just a fine tuning issue, nothing wrong with the goa; Maybe this can help you Pavel - Bugs 1–4: We don't have them

Bug 1 — BPB hardcoding 3.5:
Their code: val_loss / (math.log(2) * 3.5). Ours at line 2489:
val_bpb = val_loss / math.log(2.0) * (token_count / max(byte_count, 1.0))
We compute actual token→byte ratios from sentencepiece LUTs (base_bytes_lut, has_leading_space_lut,
is_boundary_token_lut). CLEAN.

Bug 2 — Single val shard:
Their code: np.memmap(files[0], ...). Ours at line 340:
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
All shards concatenated. CLEAN.

Bug 3 — No byte-level accounting:
We built the full sentencepiece LUT stack and pass it through every eval path. CLEAN.

Bug 4 — Single training shard:
They initialize and never advance. Our TokenStream.take() (line 552–564) does this:
while remaining > 0:
avail = self.tokens.numel() - self.pos
if avail <= 0:
self._advance_file() # advances file_idx % len(files), reloads
continue
...
All shards cycle through. CLEAN.


Bug 5 — DeltaNet state kwarg: FLAGGED

This is the one that could hit us. Our call at line 1445:
o, new_state = _fla_chunk_delta_rule(
q=q, k=k, v=v, beta=beta,
initial_state=state, # <-- is this the right kwarg name?
output_final_state=True,
)

-- initial_state (singular). We're clean on all five bugs.

Our DeltaNet state threading is correct — loop N's final state is properly seeded into loop N+1 exactly as
designed.

-- Pavel your contribution was genius, it just had some bugs. OOFC I use AI agents for rapid development, and I could be wrong, But I did just smoke test these claims and the results still hold. Deltanet is a huge upsiade and you deserve the clean SOTA submission for this week imo even though there were bugs. It unlocked a completely volitile system on my end which means this is just the tip of the iceburg!

@shalyhinpavel
Copy link
Copy Markdown
Author

Appreciate the backup, @newjordan! Respect for checking the plumbing on your end. The initial_state kwarg in the raw Triton kernel is definitely the way to go — much cleaner than the high-level wrapper I was messing with in V1.
Keep wrestling with that volatile crawler! I’m currently finalizing a different approach to stabilize the recursion without needing reverse gradient funnels (focusing on the forward-pass readout dynamics instead). Will drop V2/V3 as soon as my compute clears up. Let's push this sub-1.00 frontier together! 🤝

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Gated DeltaNet (GDN), Pure Neural

BPB: 1.0226 (claimed, withdrawn by author) | Seeds: 3 (42 / 1337 / 2024) | Artifact: ~14.1 MB claimed | Compliance: CONCEDED BROKEN by author

What this does: 8 layers of fla.layers.delta_net.DeltaNet (Gated Linear Attention) + 1 standard causal self-attention block on top, n_embd=384, n_head=6, vocab=1024, SP1024 tokenizer. Pure neural — no TTT, no n-gram cache, no SLOT, no eval-time mixers. Trains with AdamW (lr 1.8e-3, betas 0.9/0.95), a three-stage "golden ladder" batch/chunk curriculum (B=64/T=64 → B=128/T=128 → B=192/T=256), int8 row-scale quantization + DEFLATE zip. Declares fla-core==0.4.2, flash-linear-attention==0.4.2, transformers==4.44.2 in records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/requirements.txt.

What I found in the code (records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/train_gpt.py):

  1. BPB constant is hardcoded, line 228: val_bpb = val_loss / (math.log(2) * 3.5). The submission scores val_loss / (ln 2 · 3.5) regardless of the tokenizer's actual token→byte ratio. Per README and Issue BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897, BPB must come from real SentencePiece piece-length accounting (base_bytes, leading-space, boundary-token corrections); 3.5 is arbitrary and has no relation to the fineweb_1024_bpe vocab.

  2. FastLoader trains on a single training shard, line 177: self.data = np.memmap(files[0], dtype=np.uint16, mode='r'). Only the lexicographically first fineweb_train_*.bin file is ever opened. All 2,320 training steps over 3 seeds drew from the same ~100M-token slice instead of the full 10B FineWeb split.

  3. Eval loads a single val shard and runs for a fixed 50 batches × 16 × 1024 ≈ 819,200 tokens, lines 198 & 224: val_path = os.path.join(base_data_dir, "fineweb_val_000000.bin") and for i in range(50): .... Evaluation never touches fineweb_val_000001.bin or beyond, and never runs a full sliding-window stride pass over the held-out set. @Eppie flagged the 50-batch cap upstream; it's still in the committed artifact.

  4. Cross-chunk DeltaNet state is dropped silently, lines 68–72: GatedDeltaBlock.forward calls self.delta_net(self.ln_1(x), state=state). fla.layers.delta_net.DeltaNet.forward (fla-core 0.4.x) takes hidden_states, attention_mask, past_key_values, use_cache, output_attentions — there is no state kwarg. state lands in **kwargs and is ignored. The outer loop in main() (lines 256–259) then calls .detach() on whatever is returned from new_states, but nothing recurrent is ever threaded in. With chunk_size ∈ {64, 128, 256} and seq_len = 1024, the recurrent state never crosses chunk boundaries — the defining property of a linear-attention RNN is absent from every forward pass of training and eval.

  5. Targets and inputs are clipped to the first 1024 ids, lines 184 and 221: np.clip(buf, 0, 1023, out=buf). The fineweb_1024_bpe tokenizer does emit ids in [0, 1023], so this is a no-op on the intended data, but it would silently corrupt any shard using a larger vocab — worth flagging as a latent footgun rather than a current bug.

  6. Roundtrip load uses strict=False, line 218: m.load_state_dict(new_sd, strict=False). Any key mismatch between the quantized state dict and the model will be swallowed. Given the custom wte/lm_head tying logic and the int8 dequant path, a silent partial load can't be ruled out.

  7. The log file numbers are not the "1.0226" BPB. records/.../train.log prints Final BPB: 1.0288 / 1.0198 / 1.0194, averaging 1.0227, but each of those is the val_loss / (ln 2 · 3.5) output from the broken official_judge path above. They are not verified by any compliant evaluator.

Smoke test (CT2038 proteus-engine, CPU, 2026-04-11): IMPORT_FAIL error=ModuleNotFoundError("No module named 'fla'"). The script has a hard import of fla.layers.delta_net.DeltaNet at module top (line 7), so any eval environment without flash-linear-attention / fla-core 0.4.2 installed cannot even parse this submission. This is consistent with the PR's requirements.txt; flagging it only so reviewers know the container image will need those wheels for any re-run.

Questions / flags:

  • Per Issue BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897 and the README's tokenizer-agnostic BPB definition, the hardcoded 3.5 bytes-per-token constant means the reported number is not comparable to any other leaderboard entry.
  • Per the README's validation protocol, evaluation must consume the full fineweb_val split with proper token→byte accounting. The 50-batch / single-shard loop here fails both conditions.
  • Per the architecture's own premise, cross-chunk recurrence via DeltaNet state is the whole point of using a linear-attention RNN at this parameter count. The kwarg-name bug (state= vs past_key_values=) means the recurrence was never active during the reported runs.

Author response: In comment #4149774501, @shalyhinpavel concedes all four bugs (single-shard train, single-shard eval, 3.5 BPB constant, dead DeltaNet state) and explicitly withdraws the leaderboard claim for this artifact: "I concede the leaderboard claim for this specific artifact ... I will submit a mathematically and architecturally sound V2 shortly." The author deserves credit for the fast, unambiguous concession.

Verdict: COMPLIANCE FLAG — reported BPB does not reflect the claimed architecture and does not use compliant byte accounting. Author has already conceded the record claim.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

  • CLOSE (or convert to non-record / draft) — the 1.0226 number cannot stand as a leaderboard entry given (a) hardcoded 3.5 BPB constant, (b) single-shard train and eval, (c) DeltaNet recurrence dead at the kwarg level, and (d) author's own concession in #4149774501. The Gated DeltaNet direction is genuinely interesting as a pure-neural baseline and worth encouraging in a V2 PR that rewires the recurrence (past_key_values= per fla-core 0.4.x), consumes all train/val shards, and uses the repo's standard build_sentencepiece_luts byte accounting.

Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — module fla (flash-linear-attention 0.4.2) not present in the smoke environment; this is a hard dependency declared in the PR's requirements.txt, not a script bug. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 90d5115e4071a2538e05dafeb6e97c9490d0afe8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants