Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon by DeepReinforce · Pull Request #726 · openai/parameter-golf

DeepReinforce · 2026-03-25T14:45:44Z

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon

val_bpb: 1.1147 (3-seed mean, std 0.0005) | ~15.23 MB (mean int6+lzma + code) | 8×H100 80GB HBM3

Results (8×H100 80GB HBM3, PyTorch 2.10.0+cu126)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT time	Artifact
1337	83.1ms	7,223	1.1171	1.1140	-0.0031	385s	15,977,541
1111	83.0ms	7,227	1.1178	1.1149	-0.0029	383s	15,964,369
13	83.0ms	7,226	1.1179	1.1151	-0.0028	386s	15,957,041
Mean	83.1ms	7,225	1.1176	1.1147 (std 0.0005)	-0.0029	~384s

Pre-TTT bpb is final_int6_sliding_window_exact (GPTQ-lite int6 dequantized weights, sliding window, stride 64). Post-TTT bpb is legal_ttt_exact after score-first chunk adaptation. Artifact is Total submission size int6+lzma (compressed weights + UTF-8 train_gpt.py bytes).

Key changes

1. Training data pipeline

The old loader walked shards in fixed order and took contiguous spans per rank. The new DistributedTokenLoader:

Memory-maps each .bin shard (numpy.memmap) with a small header-token cache to avoid repeated header reads.
Samples global training windows across shards: per batch, it draws multiple shards with probability weighted by remaining usable blocks (exponent tightens as training progresses), uses a coprime stride over valid 2048-token blocks inside each shard, and interleaves windows for batch diversity.
Merges nearby reads on the same shard into one slab copy to reduce mmap churn.
Overlaps I/O and compute: a daemon thread builds CPU (x, y) batches into a queue while the main thread trains; CUDA streams and events prefetch the next batch to GPU while the current step runs.

Together this replaces the old strictly sequential per-rank stream with a stratified, multi-shard mixture and async H2D overlap. Model forward, optimizer (Parallel Muon + AdamW), EMA/SWA, late QAT, and export paths are otherwise in the same family as the previous script.

2. Defaults and TTT

TTT_ENABLED defaults to 1 in train_gpt.py (it was 0 in train_gpt_old.py), so legal score-first TTT runs at the end unless disabled.
TTT_FREEZE_BLOCKS defaults to 2 for the logged runs (freeze_blocks=2 in ttt_sliding:start in the training logs). The prior README’s command used TTT_FREEZE_BLOCKS=0 (all blocks); this submission’s numbers are with two frozen bottom blocks during TTT unless you override the env var.

3. LeakyReLU² (unchanged from prior submission)

Same MLP nonlinearity as before:

x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5)
return F.linear(x.square(), down_w.to(x.dtype))

LeakyReLU(0.5) keeps gradients on negative pre-activations; squaring preserves the non-negative “relu²-style” bias.

…+ Legal TTT + Parallel Muon (val_bpb: 1.1147 3-seed mean)

@abaybektursun

3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun.

…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.

Record: Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² …

2166129

…+ Legal TTT + Parallel Muon (val_bpb: 1.1147 3-seed mean)

dexhunter mentioned this pull request Mar 29, 2026

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds) #713

Open

dexhunter mentioned this pull request Mar 29, 2026

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060

Open

5 tasks

Bortlesboat mentioned this pull request Mar 29, 2026

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) #1099

Open

5 tasks

abaybektursun mentioned this pull request Mar 29, 2026

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation #1103

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

barneywohl mentioned this pull request Mar 30, 2026

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135

Open

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

clarkkev mentioned this pull request Apr 1, 2026

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726
DeepReinforce wants to merge 1 commit intoopenai:mainfrom
DeepReinforce:main

DeepReinforce commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DeepReinforce commented Mar 25, 2026

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon

Results (8×H100 80GB HBM3, PyTorch 2.10.0+cu126)

Key changes

1. Training data pipeline

2. Defaults and TTT

3. LeakyReLU² (unchanged from prior submission)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant