New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875
New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)#875shalyhinpavel wants to merge 1 commit intoopenai:mainfrom
Conversation
|
Aaaaand This is my homework for the day. Well done! |
You're welcome! |
|
no kidding man.. the update mechanism. I have an entire line of experiments modeled after biology and this is the ticket it fits them like an absolute glove (theoretically) - you will be tagged in each for the ablations and I would love to see what you do with the concepts. I looked at your linkd mycelium network stuff and think along the same lines. We are brute forcing everythign now and it is going to be about nuanced interactions between the layers... anyways. this is what i got from your work so far and totally agree! |
Thank you, I'm glad there are people who see this and understand the essence, hahaha. Indeed, I took a lot of references from biology at both the macro and micro levels. Regarding brute-forcing, I'm sure it's necessary, as optimization is the next and natural part of the evolution of both biological and artificial mechanisms. I'll be in touch! |
|
bro.. you goot recurrence to work.. I have a version thats an insane compressor, like insane, but its not moving my bpb much. there is a chance... very very exciting resaerch for me for at least 3 or 4 days on this. |
8L (7 DeltaNet + 1 Attention), 384d, O(n) linear attention. Base: PR openai#875 (1.0226 BPB). Added: EMA(0.997), cosine warmdown, per-row int8 + LZMA, proper SentencePiece BPB eval, Score-First TTT. 507 lines, 32KB. Target: sub-1.0 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Looks promising, but you're only evaluating on the first 819,200 tokens of the validation dataset. What bpb do you get if you evaluate against the full validation dataset? |
Hello! I hardcoded the eval loop to 50 batches (~819k tokens) simply to speed up local iterations and ensure the entire script (train + pack + evaluate) stays comfortably within the strict 10-minute wall-clock limit during development. |
4 bio concepts redesigned for DeltaNet chunk seams: - Astrocyte: seam controller gates erase/write per chunk activity - Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck - Clonal Selection: top-K specialist state amplification at seams - Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9. Implementation order defined. Not copying PR openai#875 code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0.9984, std=0.1724 Seeds: 42 (0.8104 SW), 300 (0.9578 SW), 1337 (1.2269 SW). Includes unravel A/B diagnostic scripts from Medusa_II (all variants tied at 1.0047 — checkpoint-level fragility, not GPTQ config). DeltaNet heads introduce significant cross-seed variance vs ClownCar (0.00015). Successor to PR openai#990, catalyzed by PR openai#875. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I went through this submission line by line. There are only 373. The reported 1.0226 BPB is not trustworthy. Every component of the evaluation pipeline is broken, and training itself has two silent bugs that undermine the architecture's core premise. Eval bugs
Training bugs
The real BPB, trained on all shards with functioning recurrence and evaluated with proper byte-level accounting over the full validation set, is unknown. The architecture choice is interesting, but the submission needs a ground-up rewrite of the data loading, the evaluation pipeline, and the recurrence plumbing before the number means anything |
Successor to PR openai#990 (ClownCar, 1.1813 BPB). Catalyzed by PR openai#875 (@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (chunk_delta_rule), loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99). 4 flat + 1 crawler x 4 loops, INST_DIM=32, 8xH100 SXM. Seeds: 42=0.8104, 300=0.9578, 1337=1.2269 SW BPB. High cross-seed variance (std=0.1724 vs ClownCar 0.00015) — stabilization ongoing in Medusa_VII. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Thank you for the rigorous, line-by-line audit. This is exactly the kind of red-teaming I was hoping for when open-sourcing this early MVP, and you are 100% correct on all points: |
|
I'll double check my implementation again to make sure my delta-net doesnt have same issues. I do believe the concept to be solid if implementations had bugs, so its just a fine tuning issue, nothing wrong with the goa; Maybe this can help you Pavel - Bugs 1–4: We don't have them Bug 1 — BPB hardcoding 3.5: Bug 2 — Single val shard: Bug 3 — No byte-level accounting: Bug 4 — Single training shard: Bug 5 — DeltaNet state kwarg: FLAGGED This is the one that could hit us. Our call at line 1445: -- initial_state (singular). We're clean on all five bugs. Our DeltaNet state threading is correct — loop N's final state is properly seeded into loop N+1 exactly as -- Pavel your contribution was genius, it just had some bugs. OOFC I use AI agents for rapid development, and I could be wrong, But I did just smoke test these claims and the results still hold. Deltanet is a huge upsiade and you deserve the clean SOTA submission for this week imo even though there were bugs. It unlocked a completely volitile system on my end which means this is just the tip of the iceburg! |
|
Appreciate the backup, @newjordan! Respect for checking the plumbing on your end. The initial_state kwarg in the raw Triton kernel is definitely the way to go — much cleaner than the high-level wrapper I was messing with in V1. |
Summary
New SOTA record using Gated DeltaNet (GDN) architecture, achieving 1.0226 BPB (average over 3 runs). This is a pure neural approach without any test-time training (TTT) or external caching.
Key Changes
GatedDeltaBlock(Gated DeltaNet) layers.FastLoaderwith non-blocking prefetching and pinning.Reproducibility
records/track_10min_16mb/2026-03-26_Pure_Neural_GDN_1.0226/train.log.