New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel) by shalyhinpavel · Pull Request #875 · openai/parameter-golf

shalyhinpavel · 2026-03-26T17:45:18Z

Summary

New SOTA record using Gated DeltaNet (GDN) architecture, achieving 1.0226 BPB (average over 3 runs). This is a pure neural approach without any test-time training (TTT) or external caching.

Key Changes

Architecture: Replaced standard Attention with GatedDeltaBlock (Gated DeltaNet) layers.
Optimization: Implemented a dynamic batch size and chunk size curriculum based on elapsed time.
Data Loading: Switched to FastLoader with non-blocking prefetching and pinning.
Efficiency: Achieved significant BPB improvement (~0.0968 nats) over the previous leaderboard record.

Reproducibility

Artifact Size: Under 16MB (zlib + int8 quantization).
Training Time: ~10 minutes on 8xH100.
Logs: 3 full runs provided in records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/train.log.

newjordan · 2026-03-27T15:40:18Z

Aaaaand This is my homework for the day. Well done!

shalyhinpavel · 2026-03-27T15:58:04Z

Aaaaand This is my homework for the day. Well done!

You're welcome!
Please note that there is still a lot of untapped potential here since I launched this on only 1 H100 and deliberately left some space for attempts to improve.

newjordan · 2026-03-27T16:04:16Z

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

shalyhinpavel · 2026-03-27T16:17:13Z

no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree!

Thank you, I'm glad there are people who see this and understand the essence, hahaha. Indeed, I took a lot of references from biology at both the macro and micro levels. Regarding brute-forcing, I'm sure it's necessary, as optimization is the next and natural part of the evolution of both biological and artificial mechanisms. I'll be in touch!

newjordan · 2026-03-27T16:30:50Z

bro.. you goot recurrence to work.. I have a version thats an insane compressor, like insane, but its not moving my bpb much. there is a chance... very very exciting resaerch for me for at least 3 or 4 days on this.

8L (7 DeltaNet + 1 Attention), 384d, O(n) linear attention. Base: PR openai#875 (1.0226 BPB). Added: EMA(0.997), cosine warmdown, per-row int8 + LZMA, proper SentencePiece BPB eval, Score-First TTT. 507 lines, 32KB. Target: sub-1.0 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eppie · 2026-03-28T00:59:24Z

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

shalyhinpavel · 2026-03-28T08:01:24Z

Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset?

Hello! I hardcoded the eval loop to 50 batches (~819k tokens) simply to speed up local iterations and ensure the entire script (train + pack + evaluate) stays comfortably within the strict 10-minute wall-clock limit during development.
The variance on FineWeb-10B is usually negligible after the first million tokens, so it serves as an excellent proxy. However, you are absolutely right for the official benchmark rigor. I'll patch the eval loop to consume the entire val.bin file and update the exact BPB numbers shortly. Expect them to be extremely close. Thanks for the code review!

4 bio concepts redesigned for DeltaNet chunk seams: - Astrocyte: seam controller gates erase/write per chunk activity - Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck - Clonal Selection: top-K specialist state amplification at seams - Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9. Implementation order defined. Not copying PR openai#875 code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…0.9984, std=0.1724 Seeds: 42 (0.8104 SW), 300 (0.9578 SW), 1337 (1.2269 SW). Includes unravel A/B diagnostic scripts from Medusa_II (all variants tied at 1.0047 — checkpoint-level fragility, not GPTQ config). DeltaNet heads introduce significant cross-seed variance vs ClownCar (0.00015). Successor to PR openai#990, catalyzed by PR openai#875. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NoesisGenesis · 2026-03-28T21:08:51Z

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.
As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.
No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!
The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.

The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

@shalyhinpavel

Successor to PR openai#990 (ClownCar, 1.1813 BPB). Catalyzed by PR openai#875 (@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (chunk_delta_rule), loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99). 4 flat + 1 crawler x 4 loops, INST_DIM=32, 8xH100 SXM. Seeds: 42=0.8104, 300=0.9578, 1337=1.2269 SW BPB. High cross-seed variance (std=0.1724 vs ClownCar 0.00015) — stabilization ongoing in Medusa_VII. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shalyhinpavel · 2026-03-29T09:23:21Z

I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise.

Eval bugs

BPB is computed as val_loss / (math.log(2) * 3.5), hardcoding 3.5 bytes per token instead of computing actual byte lengths from the sentencepiece piece table. With vocab 1024, the true ratio depends on leading spaces, byte fallbacks, and boundary tokens. The 3.5 constant is wrong.

As @Eppie noted, only the first validation shard is loaded (np.memmap(files[0], ...)), and the eval loop runs for a fixed 50 batches. This scores roughly 819K tokens out of the full validation set.

No byte-level accounting whatsoever: no handling of sentencepiece piece lengths, leading space characters, or boundary token corrections.

Training bugs

FastLoader.__init__ also uses np.memmap(files[0], ...) for training data. The model trains on a single shard!

The DeltaNet cross-chunk state is silently dead. The block passes state=state, but DeltaNet's forward signature takes past_key_values, not state. The kwarg lands in **kwargs and is ignored. Cross-chunk recurrence never happens. With chunk sizes of 64/128/256, the DeltaNet operates with a chunk-sized receptive field, defeating the entire purpose of using a recurrent architecture.

The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown.

The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything

Thank you for the rigorous, line-by-line audit. This is exactly the kind of red-teaming I was hoping for when open-sourcing this early MVP, and you are 100% correct on all points:
The files[0] hardcode in FastLoader was a brutal oversight, effectively causing the model to overfit on a single shard.
The state vs past_key_values kwarg mismatch silently killed the cross-chunk recurrence. The model was operating completely blind beyond the chunk size.
The 3.5 BPB constant and 50-batch loop were temporary local-testing crutches that shouldn't have made it into the final submission.
I agree that the reported 1.0226 BPB cannot be treated as an official SOTA given the evaluation and plumbing bugs. I will leave this PR open as a conceptual baseline for the Gated DeltaNet approach, but I concede the leaderboard claim for this specific artifact.
The irony here is that after patching the multi-sharding and correctly passing the recurrent state locally, the architecture's throughput literally doubled and the loss curve steepened significantly (even when constrained to a single A100). The underlying structural thesis holds, but it desperately needed these exact plumbing fixes to breathe properly.
I am currently doing a ground-up rewrite of the evaluation pipeline (incorporating proper SentencePiece byte-accounting) and the recurrence plumbing. I will submit a mathematically and architecturally sound V2 shortly.
I deeply appreciate the deep dive. This level of scrutiny makes the whole competition better.

newjordan · 2026-03-29T12:22:13Z

I'll double check my implementation again to make sure my delta-net doesnt have same issues. I do believe the concept to be solid if implementations had bugs, so its just a fine tuning issue, nothing wrong with the goa; Maybe this can help you Pavel - Bugs 1–4: We don't have them

Bug 1 — BPB hardcoding 3.5:
Their code: val_loss / (math.log(2) * 3.5). Ours at line 2489:
val_bpb = val_loss / math.log(2.0) * (token_count / max(byte_count, 1.0))
We compute actual token→byte ratios from sentencepiece LUTs (base_bytes_lut, has_leading_space_lut,
is_boundary_token_lut). CLEAN.

Bug 2 — Single val shard:
Their code: np.memmap(files[0], ...). Ours at line 340:
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
All shards concatenated. CLEAN.

Bug 3 — No byte-level accounting:
We built the full sentencepiece LUT stack and pass it through every eval path. CLEAN.

Bug 4 — Single training shard:
They initialize and never advance. Our TokenStream.take() (line 552–564) does this:
while remaining > 0:
avail = self.tokens.numel() - self.pos
if avail <= 0:
self._advance_file() # advances file_idx % len(files), reloads
continue
...
All shards cycle through. CLEAN.

Bug 5 — DeltaNet state kwarg: FLAGGED

This is the one that could hit us. Our call at line 1445:
o, new_state = _fla_chunk_delta_rule(
q=q, k=k, v=v, beta=beta,
initial_state=state, # <-- is this the right kwarg name?
output_final_state=True,
)

-- initial_state (singular). We're clean on all five bugs.

Our DeltaNet state threading is correct — loop N's final state is properly seeded into loop N+1 exactly as
designed.

-- Pavel your contribution was genius, it just had some bugs. OOFC I use AI agents for rapid development, and I could be wrong, But I did just smoke test these claims and the results still hold. Deltanet is a huge upsiade and you deserve the clean SOTA submission for this week imo even though there were bugs. It unlocked a completely volitile system on my end which means this is just the tip of the iceburg!

shalyhinpavel · 2026-03-29T13:58:19Z

Appreciate the backup, @newjordan! Respect for checking the plumbing on your end. The initial_state kwarg in the raw Triton kernel is definitely the way to go — much cleaner than the high-level wrapper I was messing with in V1.
Keep wrestling with that volatile crawler! I’m currently finalizing a different approach to stabilize the recursion without needing reverse gradient funnels (focusing on the forward-pass readout dynamics instead). Will drop V2/V3 as soon as my compute clears up. Let's push this sub-1.00 frontier together! 🤝

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)

90d5115

shalyhinpavel mentioned this pull request Mar 27, 2026

Credits for Runpod from OpenAI #942

Open

newjordan mentioned this pull request Mar 28, 2026

ClownCar: Frugendorff compression baseline + canonical DeltaNet integration #990

Open

newjordan mentioned this pull request Mar 28, 2026

Medusa: Unstable — DeltaNet Crawler 0.8104 BPB 10mb file size(best seed), mean 0.9984, Frugendorff continuation #1028

Open

newjordan mentioned this pull request Mar 29, 2026

INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. #1047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875
shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
shalyhinpavel:main

shalyhinpavel commented Mar 26, 2026

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

shalyhinpavel commented Mar 27, 2026

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

shalyhinpavel commented Mar 27, 2026

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

Eppie commented Mar 28, 2026

Uh oh!

shalyhinpavel commented Mar 28, 2026

Uh oh!

NoesisGenesis commented Mar 28, 2026

Uh oh!

shalyhinpavel commented Mar 29, 2026

Eval bugs

Training bugs

Uh oh!

newjordan commented Mar 29, 2026 •

edited

Loading

Uh oh!

shalyhinpavel commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shalyhinpavel commented Mar 26, 2026

Summary

Key Changes

Reproducibility

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

shalyhinpavel commented Mar 27, 2026

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

shalyhinpavel commented Mar 27, 2026

Uh oh!

newjordan commented Mar 27, 2026

Uh oh!

Eppie commented Mar 28, 2026

Uh oh!

shalyhinpavel commented Mar 28, 2026

Uh oh!

NoesisGenesis commented Mar 28, 2026

Eval bugs

Training bugs

Uh oh!

shalyhinpavel commented Mar 29, 2026

Eval bugs

Training bugs

Uh oh!

newjordan commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shalyhinpavel commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

newjordan commented Mar 29, 2026 •

edited

Loading