Nightcrawler

newjordan · 2026-04-01T04:16:49Z

Adds a fifth flat transformer layer on each side of the crawler bottleneck (5F+1C+5F vs 4F+1C+4F), with shared TAP encoder connections to each crawler loop.

Results

Seed	val_bpb (sliding window)	Steps	Size
444	1.17651313	7074	10048191 B
4	1.17676091	7074	10266138 B
300	1.17490448	7077	10343385 B
mean	1.1761		10343385 B

Hardware: 8×H100 SXM · 600s wallclock · `bytes_code`: 119294

Reproduce

```bash
SEED=444 NPROC_PER_NODE=8 torchrun --standalone --nproc_per_node=8
records/track_10min_16mb/2026-04-01_Nightcrawler_8xH100/train_gpt.py
```

Phrase cache (PR openai#880 / PR openai#900 — proven +0.1 BPB, legal): - Variable-length suffix matching at 48/36/28/20/16 token probe lengths - One ctx+full count table pair per probe length (4M buckets each) - 48-prime XOR hash — unique prime per context position up to length 48 - Dirichlet smoothing: p=(min(fc,cc)+c*neural)/(ctx+c), c=2.0 - Applied inline after n-gram mixing, before NLL conversion - Score-first: tables updated with chunk tokens AFTER all scoring done RegimeTracker (PR openai#880): - Tracks match rate + token diversity over rolling 4096-token window - Adapts effective phrase concentration: repetitive/boilerplate content → lower c (more cache trust); novel prose → higher c (more neural trust) - Multiplier range [0.7, 1.5], effective_c = base_c / mult Config improvements: - WARMDOWN_ITERS=2000 (confirmed best from A/B sweep) - NGRAM_CHUNK_TOKENS=65536 (PR openai#850, 15x more cache refreshes vs 1M) - MATRIX_LR=0.03 (PR openai#859) ARTIFACT_NGRAM=0 remains disabled (legally gray). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Siphon trains on -log(α·p_ngram + (1-α)·p_model) instead of standard CE. GPU-side bigram hash count tables, zero new params, ~0.3ms overhead. Includes 200s A/B test scripts (on/off) and 600s full run. WARMDOWN_ITERS=2000 folded in from sweep results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r shared crawler Replaces fixed orthogonal loop_pos offsets (UT-style) with a learned instruction channel derived from the flat encoder's output. The encoder generates per-token, per-iteration offsets via a low-rank projection (inst_proj: model_dim → K*inst_dim, inst_up: inst_dim → model_dim per loop). This resolves the shared-weight gradient conflict from Frugendorff by giving the crawler a content-aware signal for each loop iteration. Also ports SOTA eval stack from purple_1: Dirichlet n-gram smoothing, phrase cache (48/36/28/20/16 tok suffix matching), RegimeTracker. Training config from green v1 (SOTA 1.1129 BPB) + MATRIX_LR=0.03, WARMDOWN=2000. Also logs H-FRUG findings (all three hypotheses failed, root cause documented). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A1-A6 ablation ladder (control → Frugendorff baseline → FX-Wing → width/depth sweep). FX-1 through FX-5 further research directions if main hypothesis confirms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l.sh Cleaner home for the bio-concept sweep script. REPO_ROOT updated for new depth (experiments/Biology_concepts/ → ../../). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Log Siphon as dead (+0.151 sliding), add warmdown hypotheses doc. Only confirmed change from v1: WARMDOWN_ITERS 3500→2000 (-0.0087 at 200s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Key results (180s, 8xH100): - baseline: 1.1981 base / 0.4742 ngram9 BPB (2058 steps) - tornado: 1.3614 base / 0.5221 ngram9 BPB (1105 steps, +0.048 worse) Root cause: cold EMA teacher + double-forward overhead (163ms vs 87ms/step). Finding: ngram system rescues weak base models MORE — confirms hard-token specialization is the right target. Bio concept results TBD (sweep still running). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Proven technique from PR#803 (0.442 BPB). Downweights bigram-predictable tokens during training. Code already exists in green/train_gpt.py, just needs the alpha turned on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Python for-loop in _run_crawler breaks torch.compile fullgraph tracing. FULLGRAPH=0 allows compile with graph breaks — restores ~3x throughput vs no-compile (194ms/step → ~60ms/step, 3k→9k steps in 600s budget). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

4 loops = 4x effective depth from 1 shared block. Artifact cost: near-zero (same weights, just looped more). Previous run was 8.6MB at LOOPS=2 — LOOPS=4 should be similar. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

v7 (WD=2000 + COMPLEMENT=0.5): sliding 1.1169, ngram 0.4500 v1 SOTA: sliding 1.1129, ngram 0.4489 Complement training costs +0.004 sliding, +0.001 ngram. Dead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Shared crawler block serves K loop contexts simultaneously. int6 (±31) assigns one quantization range to weights that must represent K different input distributions — error compounds per loop (the Frugendorff catastrophe). CRAWLER_QUANT_INT8=1 routes crawler_blocks.* params to int8 (±127) while keeping flat_blocks at int6. 4× more quantization levels for the reservoir. Dequantization path unchanged (int8 already handled via meta type field). Test design: same final_model.pt exported with int8=0 and int8=1. Watch: final_int6_roundtrip_bpb - post_ema_val_bpb gap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spins up 1x H100 (configurable), git clones test branch, uploads data, runs FX_Wing/run.sh with NPROC=1, pulls back final_model.pt + int6.ptz. Auto-destroys instance after run. Pattern mirrors vast_cobra_ab_single_gpu.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

4 bio concepts redesigned for DeltaNet chunk seams: - Astrocyte: seam controller gates erase/write per chunk activity - Myelin: Fibonacci-spaced chunk bridges bypass compression bottleneck - Clonal Selection: top-K specialist state amplification at seams - Circadian: φ-spaced irrational gate prevents recurrent attractor lock-in Full ablation ladder C0→C6 targeting <1.06 base BPB, <0.44 ngram9. Implementation order defined. Not copying PR openai#875 code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add train_gpt.py (green copy + GatedDeltaNet chunked recurrence, bottom 6 layers) and run.sh for Cambrian-0 baseline. Bio seam controllers (Astrocyte, Myelin, Clonal, Circadian) to be layered on top per HYPOTHESIS.md ablation ladder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Hardcoded 180s was blocking run_all.sh from passing longer wallclocks. All 5 concept scripts now use ${MAX_WALLCLOCK_SECONDS:-180} pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… Astrocyte) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

micro_train_gpt.py: replaces hard CUDA requirement with cuda/mps/cpu fallback. All autocast and synchronize calls made device-aware. run_micro.sh: tiny model (dim=128, 2f+1cx2, inst_dim=16), 120s wallclock, single python3 process (no torchrun/DDP). CRAWLER_QUANT_INT8=1 active. Tests instructed recurrence concept on any available device. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sequential delta-rule state S [B,H,Dh,Dh] is initialized to zero before the loop and carried between crawler iterations. Each pass corrects prediction errors via: S += β * outer(v - S@k, k), giving genuine iterative refinement rather than repeating the same computation K times. Changes: - train_gpt.py: DeltaNetMemory class, DELTA_NET_HEADS hyperparam, CrawlerGPT init + _run_crawler state wiring, build_model() call site - micro_train_gpt.py: identical DeltaNet additions (GB10 Blackwell path) - run.sh: DELTA_NET_HEADS=2 - run_micro.sh: DELTA_NET_HEADS=2 Output projection zero-initialized → starts as residual no-op, won't hurt baseline; any improvement is pure gain from the memory mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

torch.compile unrolls for t in range(T=2048) into 2048 graph nodes; constant folding then tries to allocate [B,T,H,Dh] all at once → OOM on H100. @torch.compiler.disable lets the rest of the model compile normally while DeltaNet runs in eager, keeping the sequential T-loop intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@torch.compiler.disable on DeltaNet.forward causes a graph break that triggers a sympy NaN bug in PyTorch 2.4's inductor on the preceding RoPE subgraph. Disable DeltaNet for this run to get a clean instructed- recurrence result. DeltaNet will be re-enabled once the compile path is fixed (vectorized scan or suppress_errors). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NTK-scaled RoPE with seq_len=2048 causes sympy to hit Invalid NaN comparison during inductor value-range analysis (22081^(pos*0.0625) overflows to inf). suppress_errors=True lets that subgraph fall back to eager silently while the rest of the model compiles normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Nested Python loops over chunks×tokens with accumulated state S cannot be traced by dynamo (shape inference fails). Mark forward as @torch.compiler.disable — XSA attention layers still compile normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GatedDeltaNet.forward is @torch.compiler.disable — fullgraph=True conflicts with graph breaks. Allow graph-breaks so XSA layers still compile and GDN runs eagerly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_astro_gate is (B,1). Two unsqueezes made it (B,1,1,1) — 4D — which broadcast with bci (B,H,C) to (B,B,H,C) instead of (B,H,C). view(B,1,1) gives (B,1,1) — correct 3D broadcast. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

6x GatedDeltaNet (fla) + 5x standard attention, flat sequential. Parallel Muon on MLP banks, AdamW on DeltaNet weights. No U-Net skips, no SmearGate, no DTG (incompatible with DeltaNet). Full n-gram eval + XSA-5 + BigramHash + Trigram on top. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ments default) Fixes OOM on GDN backward — allocator fragmentation caused 20MiB alloc to fail even with 12GiB reserved-but-unallocated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Model has SDPA fallback built in. Step time will be slightly slower but results are valid. Abort was blocking fresh pods without FA3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…oadmap Full leaderboard analysis (2026-03-31): we hold best legal open PR (openai#1120 at 1.10987). Only PR openai#1089 (1.1091) beats us — by 0.00077 BPB. Stack audit of Rascal II: LeakyReLU²/LN-scale/XSA-all already present. GPTQ code exists but SKIP_GPTQ=1. Warmdown 3500 vs leaders' 4000. BigramHash 2048 vs leaders' 3072. zstd-22 vs Brotli-11. Adds 4 research threads with prioritized hypothesis queue: 1. Rascal_III_GPTQ (biggest gap, code already in script) 2. Rascal_III_ARcal (self-gen calibration after GPTQ confirmed) 3. Rascal_III_Bigram3072 (vocab coverage, +~50KB) 4. Rascal_III_Warmdown4k + Brotli/minify Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t_{early,late}, swa_dense) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gptq (SKIP_GPTQ=0): biggest competition gap, code already written bigram_3072: exact competition target from PR openai#1019 warmdown_4k: WARMDOWN_ITERS 3500→4000, matches leaders Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…KPOINT) Adds SKIP_TRAIN/LOAD_CHECKPOINT support to train_gpt.py so post-only cases skip the training loop and run GPTQ on an already-trained checkpoint. Saves ~570s per post_only case. gptq is the only post_only case for now. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Corrects SCIENCE.md: "SLOT ruled illegal" was wrong — a community member (msisovic, author_association:NONE), not an organizer. No official ruling from 0hq/valerio-oai/xuandong-openai on any SLOT PR. All SLOT PRs remain open as of 2026-03-31. Full legality analysis of our SLOT implementation: - Delta optimized on y_batch including 64 not-yet-scored new tokens (3.1%) - First window: 100% new tokens used for delta optimization - Technically non-compliant with "already evaluated" rule - Context-only SLOT (adapt on scored prefix only) is unambiguously legal Adds SLOT thread with proxy result (-0.0085 sw_bpb, 1200 steps), legal status table, and context-only SLOT fix design. Strategy: wait for official organizer ruling before including SLOT in any submission. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CRAWLER_TAP_DIM=32 CRAWLER_TAP_LOOP_SPECIFIC=0 on BW5 baseline. Gate evidence: −0.00352 int6_sw vs control (BW7 MegaGate, 2000 steps). Target: beat 1.18672385 BPB (BW5 champion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Optimizes hidden-state delta only on positions 0..wlen-stride-1 (context). New tokens wlen-stride..wlen-1 are scored under the delta but never used for optimization. Window 0 uses base model (no prior context). Causally safe. Prior ambiguous SLOT showed -0.0085 proxy delta — this tests legal signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW9: tests ANCHOR_DIM=32 on BW8 (TAP) baseline. SCIENCE.md: records all 8 MegaGate arms, TAP-03 promoted to working baseline, ANC-05 queued as next gate, FLAT/SMEAR closed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Port gptq_calibrate_loop_aware + mixed_quantize_int6_gptq from ClownCar into BW8 (TAP) base script. GPTQ runs post-training on uncompiled base_model — compatible with COMPILE_FULLGRAPH=1 training graph. Arms: BWGQ-00 (naive int6) vs BWGQ-01 (LOOP_AWARE_GPTQ=1). Historical delta: −0.0062 BPB in CL2 lineage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Complete sweep on 4×H100 pod. Key findings: - baseline: 1.154747 sliding_bpb, quant_gap=+0.0217 - All arch/sched variants flat or regressing vs baseline - warmdown_4k: +0.0034 (HURTS — time-based schedule causes earlier QAT) - GPTQ (post-train): 0 delta — calibration bug, 0 layers quantized - Legal SLOT gate passed separately (-0.0057 proxy) RESULTS.md: full data table + verdicts + GPTQ bug analysis SCIENCE.md: updated threads, priority order, dead cases marked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

One variable: SLOT_ENABLED=1 (0→1). Training unchanged, eval-only. Legal context-only SLOT: optimize delta on scored context positions only, score new stride-64 positions under adapted delta. Gate signal (prior QK_Gain_SLOT_Legal run): −0.0057 proxy BPB. Estimate: −0.0004 to −0.0011 at full 8×GPU run. Changes vs vault: - Hyperparameters: slot_enabled, slot_steps, slot_lr, slot_max_windows - GPT: forward_hidden(), compute_logits_from_hidden() - eval_val_sliding: SLOT-aware (context-only legal variant) - Call site: slot args threaded through, slot_tag in log lines SCIENCE.md: GPTQ thread updated with bug findings, all sweep verdicts added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Gate: BWFF-00 (NUM_FLAT_LAYERS=4) vs BWFF-01 (NUM_FLAT_LAYERS=5) ONE variable. 4×GPU SDPA, 2000 steps, seed=444. BW proxy winner (1.54404) was abandoned for pyramid choke (failed +0.020). Never validated at BW5/BW8 quality level. Closing the gap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BWGQ-00 (naive): 1.28889 BPB, 9.24MB BWGQ-01 (GPTQ): 1.28403 BPB, 10.23MB (+0.99MB) GPTQ calibration: 38 layers in 5.3s, post-training only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Arms (1-GPU, 120s proxy, seed=444): CTRL-00 : no eval-time adaptation SLOT-01 : legal context-only SLOT (8 steps, lr=0.005) SCALE-02 : Score-first Scale TTT (attn_scale+mlp_scale, lr=1e-4) SLOT+SCALE-03 : both combined Scale TTT is a first test of adapting only the Adam-trained (non-Muon) scale params (attn_scale + mlp_scale) per-chunk during sliding window eval. These are the Rascal equivalent of LN scale adaptation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Beats BW5 (1.18672385) by −0.00380 BPB. 7893 steps, 76ms/step, 9.96MB, GPTQ 8.6s post-training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace broken train_gpt.py (which duplicated the entire transformer body with forward_hidden/compute_logits_from_hidden) with train_gpt_slot.py — a clean vault copy with SLOT injected only in eval_val_sliding. SLOT impl: forward hook on final_norm captures hidden once (body runs once), AdamW optimizes delta on context positions using only the head inline (3 lines), scores new positions from hidden + delta. Zero model class changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Beats BW5 (1.18672385) by −0.01021 BPB. Beats BW10_GPTQ (1.18292670) by −0.00641 BPB. 7074 steps, 85ms/step, 10.05MB. Gate inflation 1.42×. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Octavian and others added 30 commits March 27, 2026 01:33

Rat Rod: add zero-cost H100 sweeps and robust trainer toggles

f8caa0c

FX-Wing: add hypothesis and ablation plan

4ab4ced

A1-A6 ablation ladder (control → Frugendorff baseline → FX-Wing → width/depth sweep). FX-1 through FX-5 further research directions if main hypothesis confirms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reorganize: move master runner to experiments/Biology_concepts/run_al…

7a81eec

…l.sh Cleaner home for the bio-concept sweep script. REPO_ROOT updated for new depth (experiments/Biology_concepts/ → ../../). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add green v6 (optimized SOTA): v1 + WARMDOWN_ITERS=2000

95e9333

Log Siphon as dead (+0.151 sliding), add warmdown hypotheses doc. Only confirmed change from v1: WARMDOWN_ITERS 3500→2000 (-0.0087 at 200s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add green v7: v6 + COMPLEMENT_ALPHA=0.5

516e2c8

Proven technique from PR#803 (0.442 BPB). Downweights bigram-predictable tokens during training. Code already exists in green/train_gpt.py, just needs the alpha turned on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Log v7 results: COMPLEMENT_ALPHA=0.5 worse than v1

909901e

v7 (WD=2000 + COMPLEMENT=0.5): sliding 1.1169, ngram 0.4500 v1 SOTA: sliding 1.1129, ngram 0.4489 Complement training costs +0.004 sliding, +0.001 ngram. Dead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix bio concept scripts: make MAX_WALLCLOCK_SECONDS env-overridable

2df9c72

Hardcoded 180s was blocking run_all.sh from passing longer wallclocks. All 5 concept scripts now use ${MAX_WALLCLOCK_SECONDS:-180} pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cambrian-1: Add four bio seam controllers (Myelin, Circadian, Clonal,…

8b93705

… Astrocyte) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

FX-Wing micro: add -u flag for unbuffered stdout through tee pipe

fa21139

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vast: blacklist offer 33510639 (103.42.50.244 — SSH never connects)

531f98f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cambrian run.sh: set COMPILE_FULLGRAPH=0

f74175d

GatedDeltaNet.forward is @torch.compiler.disable — fullgraph=True conflicts with graph breaks. Allow graph-breaks so XSA layers still compile and GDN runs eagerly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cambrian: forward PYTORCH_CUDA_ALLOC_CONF to torchrun (expandable_seg…

b55a421

…ments default) Fixes OOM on GDN backward — allocator fragmentation caused 20MiB alloc to fail even with 12GiB reserved-but-unallocated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Octavian and others added 29 commits March 31, 2026 18:03

Relax flash_attn preflight in BW7 MegaGate — warn not abort

cc36ca6

Model has SDPA fallback built in. Step time will be slightly slower but results are valid. Abort was blocking fresh pods without FA3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix torchrun path: fallback to python3 -m torch.distributed.run

0752174

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add Arch+Sched Sweep: 6-case 4×GPU ablation (rope_32, bigram_4096, qa…

e2867c2

…t_{early,late}, swa_dense) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix gate.sh: remove quotes from inline env var assignments

89a321e

Add gptq_full case: full training run with SKIP_GPTQ=0 (not post_only)

33a8144

Rascal_III_SLOT run.sh: minimal racer, exact SOTA env + SLOT_ENABLED=1

93ef50a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW10_GPTQ: gate PASS — −0.00486 int6_sw, step time clean

5efb22b

BWGQ-00 (naive): 1.28889 BPB, 9.24MB BWGQ-01 (GPTQ): 1.28403 BPB, 10.23MB (+0.99MB) GPTQ calibration: 38 layers in 5.3s, post-training only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW10_GPTQ: add production run.sh (8×H100, 600s, LOOP_AWARE_GPTQ=1)

cb4f1c2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW11_5Flat: add production run.sh (8×H100, 600s, NUM_FLAT_LAYERS=5)

dd2e06a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW10_GPTQ: full run PROMOTES — 1.18292670 BPB, new champion

2338fee

Beats BW5 (1.18672385) by −0.00380 BPB. 7893 steps, 76ms/step, 9.96MB, GPTQ 8.6s post-training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW11_5Flat: full run PROMOTES — 1.17651313 BPB, new champion

ef2c932

Beats BW5 (1.18672385) by −0.01021 BPB. Beats BW10_GPTQ (1.18292670) by −0.00641 BPB. 7074 steps, 85ms/step, 10.05MB. Gate inflation 1.42×. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Scaffold Crawler II submission — seed=444 pre-filled, seed=300 pending

6c864c1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Rename submission: Crawler II → Nightcrawler

385d704

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Nightcrawler: fill seed=300 results — 1.17490448 BPB, mean 1.1757

7b9a11b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Nightcrawler: add seed logs; fix validate.sh set-e/arithmetic bug

6f8e093

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Nightcrawler: add seed=4 (1.17676091 BPB), update mean to 1.1761

bcd26f7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

newjordan changed the title ~~Nightcrawler — 1.17651313 val_bpb (seed 444)~~ Nightcrawler — 1.176bpb 10mb Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nightcrawler — 1.176bpb 10mb #1208

Nightcrawler — 1.176bpb 10mb #1208
newjordan wants to merge 306 commits intoopenai:mainfrom
newjordan:submission/nightcrawler

newjordan commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nightcrawler

Results

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Apr 1, 2026 •

edited

Loading