Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726
Open
DeepReinforce wants to merge 1 commit intoopenai:mainfrom
Open
Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon#726DeepReinforce wants to merge 1 commit intoopenai:mainfrom
DeepReinforce wants to merge 1 commit intoopenai:mainfrom
Conversation
…+ Legal TTT + Parallel Muon (val_bpb: 1.1147 3-seed mean)
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Mar 29, 2026
3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun.
5 tasks
5 tasks
barneywohl
added a commit
to barneywohl/parameter-golf
that referenced
this pull request
Mar 30, 2026
…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon
val_bpb: 1.1147 (3-seed mean, std 0.0005) | ~15.23 MB (mean int6+lzma + code) | 8×H100 80GB HBM3
Results (8×H100 80GB HBM3, PyTorch 2.10.0+cu126)
Pre-TTT bpb is
final_int6_sliding_window_exact(GPTQ-lite int6 dequantized weights, sliding window, stride 64). Post-TTT bpb islegal_ttt_exactafter score-first chunk adaptation. Artifact isTotal submission size int6+lzma(compressed weights + UTF-8train_gpt.pybytes).Key changes
1. Training data pipeline
The old loader walked shards in fixed order and took contiguous spans per rank. The new
DistributedTokenLoader:.binshard (numpy.memmap) with a small header-token cache to avoid repeated header reads.(x, y)batches into a queue while the main thread trains; CUDA streams and events prefetch the next batch to GPU while the current step runs.Together this replaces the old strictly sequential per-rank stream with a stratified, multi-shard mixture and async H2D overlap. Model forward, optimizer (Parallel Muon + AdamW), EMA/SWA, late QAT, and export paths are otherwise in the same family as the previous script.
2. Defaults and TTT
TTT_ENABLEDdefaults to 1 intrain_gpt.py(it was 0 intrain_gpt_old.py), so legal score-first TTT runs at the end unless disabled.TTT_FREEZE_BLOCKSdefaults to 2 for the logged runs (freeze_blocks=2inttt_sliding:startin the training logs). The prior README’s command usedTTT_FREEZE_BLOCKS=0(all blocks); this submission’s numbers are with two frozen bottom blocks during TTT unless you override the env var.3. LeakyReLU² (unchanged from prior submission)
Same MLP nonlinearity as before:
LeakyReLU(0.5) keeps gradients on negative pre-activations; squaring preserves the non-negative “relu²-style” bias.