Skip to content

Nightcrawler — 1.176bpb 10mb #1208

Open
newjordan wants to merge 306 commits intoopenai:mainfrom
newjordan:submission/nightcrawler
Open

Nightcrawler — 1.176bpb 10mb #1208
newjordan wants to merge 306 commits intoopenai:mainfrom
newjordan:submission/nightcrawler

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Apr 1, 2026

nightcrawler

Nightcrawler

Adds a fifth flat transformer layer on each side of the crawler bottleneck (5F+1C+5F vs 4F+1C+4F), with shared TAP encoder connections to each crawler loop.

Results

Seed val_bpb (sliding window) Steps Size
444 1.17651313 7074 10048191 B
4 1.17676091 7074 10266138 B
300 1.17490448 7077 10343385 B
mean 1.1761 10343385 B

Hardware: 8×H100 SXM · 600s wallclock · `bytes_code`: 119294

Reproduce

```bash
SEED=444 NPROC_PER_NODE=8 torchrun --standalone --nproc_per_node=8
records/track_10min_16mb/2026-04-01_Nightcrawler_8xH100/train_gpt.py
```

Octavian and others added 30 commits March 27, 2026 01:33
Phrase cache (PR openai#880 / PR openai#900 — proven +0.1 BPB, legal):
- Variable-length suffix matching at 48/36/28/20/16 token probe lengths
- One ctx+full count table pair per probe length (4M buckets each)
- 48-prime XOR hash — unique prime per context position up to length 48
- Dirichlet smoothing: p=(min(fc,cc)+c*neural)/(ctx+c), c=2.0
- Applied inline after n-gram mixing, before NLL conversion
- Score-first: tables updated with chunk tokens AFTER all scoring done

RegimeTracker (PR openai#880):
- Tracks match rate + token diversity over rolling 4096-token window
- Adapts effective phrase concentration: repetitive/boilerplate content
  → lower c (more cache trust); novel prose → higher c (more neural trust)
- Multiplier range [0.7, 1.5], effective_c = base_c / mult

Config improvements:
- WARMDOWN_ITERS=2000 (confirmed best from A/B sweep)
- NGRAM_CHUNK_TOKENS=65536 (PR openai#850, 15x more cache refreshes vs 1M)
- MATRIX_LR=0.03 (PR openai#859)

ARTIFACT_NGRAM=0 remains disabled (legally gray).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Siphon trains on -log(α·p_ngram + (1-α)·p_model) instead of standard CE.
GPU-side bigram hash count tables, zero new params, ~0.3ms overhead.
Includes 200s A/B test scripts (on/off) and 600s full run.
WARMDOWN_ITERS=2000 folded in from sweep results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r shared crawler

Replaces fixed orthogonal loop_pos offsets (UT-style) with a learned
instruction channel derived from the flat encoder's output. The encoder
generates per-token, per-iteration offsets via a low-rank projection
(inst_proj: model_dim → K*inst_dim, inst_up: inst_dim → model_dim per loop).
This resolves the shared-weight gradient conflict from Frugendorff by giving
the crawler a content-aware signal for each loop iteration.

Also ports SOTA eval stack from purple_1: Dirichlet n-gram smoothing,
phrase cache (48/36/28/20/16 tok suffix matching), RegimeTracker.
Training config from green v1 (SOTA 1.1129 BPB) + MATRIX_LR=0.03, WARMDOWN=2000.

Also logs H-FRUG findings (all three hypotheses failed, root cause documented).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A1-A6 ablation ladder (control → Frugendorff baseline → FX-Wing → width/depth sweep).
FX-1 through FX-5 further research directions if main hypothesis confirms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l.sh

Cleaner home for the bio-concept sweep script. REPO_ROOT updated for
new depth (experiments/Biology_concepts/ → ../../).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log Siphon as dead (+0.151 sliding), add warmdown hypotheses doc.
Only confirmed change from v1: WARMDOWN_ITERS 3500→2000 (-0.0087 at 200s).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key results (180s, 8xH100):
- baseline: 1.1981 base / 0.4742 ngram9 BPB (2058 steps)
- tornado:  1.3614 base / 0.5221 ngram9 BPB (1105 steps, +0.048 worse)

Root cause: cold EMA teacher + double-forward overhead (163ms vs 87ms/step).
Finding: ngram system rescues weak base models MORE — confirms hard-token
specialization is the right target. Bio concept results TBD (sweep still running).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Proven technique from PR#803 (0.442 BPB). Downweights bigram-predictable
tokens during training. Code already exists in green/train_gpt.py, just
needs the alpha turned on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python for-loop in _run_crawler breaks torch.compile fullgraph tracing.
FULLGRAPH=0 allows compile with graph breaks — restores ~3x throughput
vs no-compile (194ms/step → ~60ms/step, 3k→9k steps in 600s budget).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 loops = 4x effective depth from 1 shared block.
Artifact cost: near-zero (same weights, just looped more).
Previous run was 8.6MB at LOOPS=2 — LOOPS=4 should be similar.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7 (WD=2000 + COMPLEMENT=0.5): sliding 1.1169, ngram 0.4500
v1 SOTA: sliding 1.1129, ngram 0.4489
Complement training costs +0.004 sliding, +0.001 ngram. Dead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shared crawler block serves K loop contexts simultaneously. int6 (±31)
assigns one quantization range to weights that must represent K different
input distributions — error compounds per loop (the Frugendorff catastrophe).

CRAWLER_QUANT_INT8=1 routes crawler_blocks.* params to int8 (±127) while
keeping flat_blocks at int6. 4× more quantization levels for the reservoir.
Dequantization path unchanged (int8 already handled via meta type field).

Test design: same final_model.pt exported with int8=0 and int8=1.
Watch: final_int6_roundtrip_bpb - post_ema_val_bpb gap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spins up 1x H100 (configurable), git clones test branch, uploads data,
runs FX_Wing/run.sh with NPROC=1, pulls back final_model.pt + int6.ptz.
Auto-destroys instance after run. Pattern mirrors vast_cobra_ab_single_gpu.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 bio concepts redesigned for DeltaNet chunk seams:
- Astrocyte: seam controller gates erase/write per chunk activity
- Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck
- Clonal Selection: top-K specialist state amplification at seams
- Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in

Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9.
Implementation order defined. Not copying PR openai#875 code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add train_gpt.py (green copy + GatedDeltaNet chunked recurrence, bottom 6 layers)
and run.sh for Cambrian-0 baseline. Bio seam controllers (Astrocyte, Myelin,
Clonal, Circadian) to be layered on top per HYPOTHESIS.md ablation ladder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoded 180s was blocking run_all.sh from passing longer wallclocks.
All 5 concept scripts now use ${MAX_WALLCLOCK_SECONDS:-180} pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… Astrocyte)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
micro_train_gpt.py: replaces hard CUDA requirement with cuda/mps/cpu
fallback. All autocast and synchronize calls made device-aware.

run_micro.sh: tiny model (dim=128, 2f+1cx2, inst_dim=16), 120s wallclock,
single python3 process (no torchrun/DDP). CRAWLER_QUANT_INT8=1 active.
Tests instructed recurrence concept on any available device.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sequential delta-rule state S [B,H,Dh,Dh] is initialized to zero before the
loop and carried between crawler iterations. Each pass corrects prediction
errors via: S += β * outer(v - S@k, k), giving genuine iterative refinement
rather than repeating the same computation K times.

Changes:
- train_gpt.py: DeltaNetMemory class, DELTA_NET_HEADS hyperparam, CrawlerGPT
  init + _run_crawler state wiring, build_model() call site
- micro_train_gpt.py: identical DeltaNet additions (GB10 Blackwell path)
- run.sh: DELTA_NET_HEADS=2
- run_micro.sh: DELTA_NET_HEADS=2

Output projection zero-initialized → starts as residual no-op, won't hurt
baseline; any improvement is pure gain from the memory mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.compile unrolls for t in range(T=2048) into 2048 graph nodes; constant
folding then tries to allocate [B,T,H,Dh] all at once → OOM on H100.
@torch.compiler.disable lets the rest of the model compile normally while
DeltaNet runs in eager, keeping the sequential T-loop intact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@torch.compiler.disable on DeltaNet.forward causes a graph break that
triggers a sympy NaN bug in PyTorch 2.4's inductor on the preceding
RoPE subgraph. Disable DeltaNet for this run to get a clean instructed-
recurrence result. DeltaNet will be re-enabled once the compile path
is fixed (vectorized scan or suppress_errors).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NTK-scaled RoPE with seq_len=2048 causes sympy to hit Invalid NaN comparison
during inductor value-range analysis (22081^(pos*0.0625) overflows to inf).
suppress_errors=True lets that subgraph fall back to eager silently while
the rest of the model compiles normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Nested Python loops over chunks×tokens with accumulated state S cannot
be traced by dynamo (shape inference fails). Mark forward as
@torch.compiler.disable — XSA attention layers still compile normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GatedDeltaNet.forward is @torch.compiler.disable — fullgraph=True
conflicts with graph breaks. Allow graph-breaks so XSA layers still
compile and GDN runs eagerly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_astro_gate is (B,1). Two unsqueezes made it (B,1,1,1) — 4D — which
broadcast with bci (B,H,C) to (B,B,H,C) instead of (B,H,C).
view(B,1,1) gives (B,1,1) — correct 3D broadcast.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6x GatedDeltaNet (fla) + 5x standard attention, flat sequential.
Parallel Muon on MLP banks, AdamW on DeltaNet weights.
No U-Net skips, no SmearGate, no DTG (incompatible with DeltaNet).
Full n-gram eval + XSA-5 + BigramHash + Trigram on top.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ments default)

Fixes OOM on GDN backward — allocator fragmentation caused 20MiB alloc to
fail even with 12GiB reserved-but-unallocated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Octavian and others added 29 commits March 31, 2026 18:03
Model has SDPA fallback built in. Step time will be slightly slower
but results are valid. Abort was blocking fresh pods without FA3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oadmap

Full leaderboard analysis (2026-03-31): we hold best legal open PR (openai#1120
at 1.10987). Only PR openai#1089 (1.1091) beats us — by 0.00077 BPB.

Stack audit of Rascal II: LeakyReLU²/LN-scale/XSA-all already present.
GPTQ code exists but SKIP_GPTQ=1. Warmdown 3500 vs leaders' 4000.
BigramHash 2048 vs leaders' 3072. zstd-22 vs Brotli-11.

Adds 4 research threads with prioritized hypothesis queue:
1. Rascal_III_GPTQ (biggest gap, code already in script)
2. Rascal_III_ARcal (self-gen calibration after GPTQ confirmed)
3. Rascal_III_Bigram3072 (vocab coverage, +~50KB)
4. Rascal_III_Warmdown4k + Brotli/minify

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t_{early,late}, swa_dense)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gptq (SKIP_GPTQ=0): biggest competition gap, code already written
bigram_3072: exact competition target from PR openai#1019
warmdown_4k: WARMDOWN_ITERS 3500→4000, matches leaders

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…KPOINT)

Adds SKIP_TRAIN/LOAD_CHECKPOINT support to train_gpt.py so post-only cases
skip the training loop and run GPTQ on an already-trained checkpoint.
Saves ~570s per post_only case. gptq is the only post_only case for now.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Corrects SCIENCE.md: "SLOT ruled illegal" was wrong — a community
member (msisovic, author_association:NONE), not an organizer. No
official ruling from 0hq/valerio-oai/xuandong-openai on any SLOT PR.
All SLOT PRs remain open as of 2026-03-31.

Full legality analysis of our SLOT implementation:
- Delta optimized on y_batch including 64 not-yet-scored new tokens (3.1%)
- First window: 100% new tokens used for delta optimization
- Technically non-compliant with "already evaluated" rule
- Context-only SLOT (adapt on scored prefix only) is unambiguously legal

Adds SLOT thread with proxy result (-0.0085 sw_bpb, 1200 steps),
legal status table, and context-only SLOT fix design.

Strategy: wait for official organizer ruling before including SLOT
in any submission.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CRAWLER_TAP_DIM=32 CRAWLER_TAP_LOOP_SPECIFIC=0 on BW5 baseline.
Gate evidence: −0.00352 int6_sw vs control (BW7 MegaGate, 2000 steps).
Target: beat 1.18672385 BPB (BW5 champion).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Optimizes hidden-state delta only on positions 0..wlen-stride-1 (context).
New tokens wlen-stride..wlen-1 are scored under the delta but never used
for optimization. Window 0 uses base model (no prior context). Causally safe.

Prior ambiguous SLOT showed -0.0085 proxy delta — this tests legal signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW9: tests ANCHOR_DIM=32 on BW8 (TAP) baseline.
SCIENCE.md: records all 8 MegaGate arms, TAP-03 promoted to working
baseline, ANC-05 queued as next gate, FLAT/SMEAR closed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port gptq_calibrate_loop_aware + mixed_quantize_int6_gptq from ClownCar
into BW8 (TAP) base script. GPTQ runs post-training on uncompiled
base_model — compatible with COMPILE_FULLGRAPH=1 training graph.

Arms: BWGQ-00 (naive int6) vs BWGQ-01 (LOOP_AWARE_GPTQ=1).
Historical delta: −0.0062 BPB in CL2 lineage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Complete sweep on 4×H100 pod. Key findings:
- baseline: 1.154747 sliding_bpb, quant_gap=+0.0217
- All arch/sched variants flat or regressing vs baseline
- warmdown_4k: +0.0034 (HURTS — time-based schedule causes earlier QAT)
- GPTQ (post-train): 0 delta — calibration bug, 0 layers quantized
- Legal SLOT gate passed separately (-0.0057 proxy)

RESULTS.md: full data table + verdicts + GPTQ bug analysis
SCIENCE.md: updated threads, priority order, dead cases marked

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
One variable: SLOT_ENABLED=1 (0→1). Training unchanged, eval-only.
Legal context-only SLOT: optimize delta on scored context positions only,
score new stride-64 positions under adapted delta.

Gate signal (prior QK_Gain_SLOT_Legal run): −0.0057 proxy BPB.
Estimate: −0.0004 to −0.0011 at full 8×GPU run.

Changes vs vault:
- Hyperparameters: slot_enabled, slot_steps, slot_lr, slot_max_windows
- GPT: forward_hidden(), compute_logits_from_hidden()
- eval_val_sliding: SLOT-aware (context-only legal variant)
- Call site: slot args threaded through, slot_tag in log lines

SCIENCE.md: GPTQ thread updated with bug findings, all sweep verdicts added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gate: BWFF-00 (NUM_FLAT_LAYERS=4) vs BWFF-01 (NUM_FLAT_LAYERS=5)
ONE variable. 4×GPU SDPA, 2000 steps, seed=444.

BW proxy winner (1.54404) was abandoned for pyramid choke (failed +0.020).
Never validated at BW5/BW8 quality level. Closing the gap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BWGQ-00 (naive): 1.28889 BPB, 9.24MB
BWGQ-01 (GPTQ):  1.28403 BPB, 10.23MB (+0.99MB)
GPTQ calibration: 38 layers in 5.3s, post-training only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Arms (1-GPU, 120s proxy, seed=444):
  CTRL-00       : no eval-time adaptation
  SLOT-01       : legal context-only SLOT (8 steps, lr=0.005)
  SCALE-02      : Score-first Scale TTT (attn_scale+mlp_scale, lr=1e-4)
  SLOT+SCALE-03 : both combined

Scale TTT is a first test of adapting only the Adam-trained (non-Muon) scale
params (attn_scale + mlp_scale) per-chunk during sliding window eval.
These are the Rascal equivalent of LN scale adaptation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Beats BW5 (1.18672385) by −0.00380 BPB.
7893 steps, 76ms/step, 9.96MB, GPTQ 8.6s post-training.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace broken train_gpt.py (which duplicated the entire transformer body
with forward_hidden/compute_logits_from_hidden) with train_gpt_slot.py —
a clean vault copy with SLOT injected only in eval_val_sliding.

SLOT impl: forward hook on final_norm captures hidden once (body runs once),
AdamW optimizes delta on context positions using only the head inline (3 lines),
scores new positions from hidden + delta. Zero model class changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Beats BW5 (1.18672385) by −0.01021 BPB.
Beats BW10_GPTQ (1.18292670) by −0.00641 BPB.
7074 steps, 85ms/step, 10.05MB. Gate inflation 1.42×.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@newjordan newjordan changed the title Nightcrawler — 1.17651313 val_bpb (seed 444) Nightcrawler — 1.176bpb 10mb Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant