Skip to content

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726

Open
DeepReinforce wants to merge 1 commit intoopenai:mainfrom
DeepReinforce:main
Open

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726
DeepReinforce wants to merge 1 commit intoopenai:mainfrom
DeepReinforce:main

Conversation

@DeepReinforce
Copy link
Copy Markdown

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon

val_bpb: 1.1147 (3-seed mean, std 0.0005) | ~15.23 MB (mean int6+lzma + code) | 8×H100 80GB HBM3

Results (8×H100 80GB HBM3, PyTorch 2.10.0+cu126)

Seed step_avg steps Pre-TTT bpb Post-TTT bpb TTT gain TTT time Artifact
1337 83.1ms 7,223 1.1171 1.1140 -0.0031 385s 15,977,541
1111 83.0ms 7,227 1.1178 1.1149 -0.0029 383s 15,964,369
13 83.0ms 7,226 1.1179 1.1151 -0.0028 386s 15,957,041
Mean 83.1ms 7,225 1.1176 1.1147 (std 0.0005) -0.0029 ~384s

Pre-TTT bpb is final_int6_sliding_window_exact (GPTQ-lite int6 dequantized weights, sliding window, stride 64). Post-TTT bpb is legal_ttt_exact after score-first chunk adaptation. Artifact is Total submission size int6+lzma (compressed weights + UTF-8 train_gpt.py bytes).

Key changes

1. Training data pipeline

The old loader walked shards in fixed order and took contiguous spans per rank. The new DistributedTokenLoader:

  • Memory-maps each .bin shard (numpy.memmap) with a small header-token cache to avoid repeated header reads.
  • Samples global training windows across shards: per batch, it draws multiple shards with probability weighted by remaining usable blocks (exponent tightens as training progresses), uses a coprime stride over valid 2048-token blocks inside each shard, and interleaves windows for batch diversity.
  • Merges nearby reads on the same shard into one slab copy to reduce mmap churn.
  • Overlaps I/O and compute: a daemon thread builds CPU (x, y) batches into a queue while the main thread trains; CUDA streams and events prefetch the next batch to GPU while the current step runs.

Together this replaces the old strictly sequential per-rank stream with a stratified, multi-shard mixture and async H2D overlap. Model forward, optimizer (Parallel Muon + AdamW), EMA/SWA, late QAT, and export paths are otherwise in the same family as the previous script.

2. Defaults and TTT

  • TTT_ENABLED defaults to 1 in train_gpt.py (it was 0 in train_gpt_old.py), so legal score-first TTT runs at the end unless disabled.
  • TTT_FREEZE_BLOCKS defaults to 2 for the logged runs (freeze_blocks=2 in ttt_sliding:start in the training logs). The prior README’s command used TTT_FREEZE_BLOCKS=0 (all blocks); this submission’s numbers are with two frozen bottom blocks during TTT unless you override the env var.

3. LeakyReLU² (unchanged from prior submission)

Same MLP nonlinearity as before:

x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5)
return F.linear(x.square(), down_w.to(x.dtype))

LeakyReLU(0.5) keeps gradients on negative pre-activations; squaring preserves the non-negative “relu²-style” bias.

…+ Legal TTT + Parallel Muon (val_bpb: 1.1147 3-seed mean)
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Mar 29, 2026
3-seed mean val_bpb: 1.1123 (std 0.0005)
All artifacts under 16MB, all eval under 600s.

Key changes from PR openai#549:
- Coprime-stride multi-shard data pipeline (PR openai#726 style)
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- BigramHash(2816×112)
- No TTT (sliding-only outperforms on this stack)

Built on PR openai#549 by @abaybektursun.
barneywohl added a commit to barneywohl/parameter-golf that referenced this pull request Mar 30, 2026
…816 (val_bpb 1.1116)

3-seed mean: 1.1116 ± 0.0005
Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118

Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky)
+ coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112)
+ fullgraph=True torch.compile

Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant