review: Rerun of PR #1089 by AnirudhRahul · Pull Request #1126 · openai/parameter-golf

AnirudhRahul · 2026-03-30T08:01:55Z

Summary

rerun the latest fetched #1089 head (215193e) from the executable submission wrapper train_gpt.py and commit the raw seed-42 rerun log plus a comparison note
document the rerun environment: 8x H100 80GB, driver 565.57.01, Python 3.12.13, torch 2.11.0+cu126, torch.version.cuda == 12.6
show that the rerun is materially slower than the bundled published seed 42 run: final step_avg 105.06ms vs 93.26ms (about 12.65% slower), with 5579 steps vs 6284
report the rerun seed-42 metrics: final_int6_sliding_window_exact val_bpb: 1.11228538 and final_int6_roundtrip_exact val_bpb: 1.13606522

Notes

reference published seed 42 values from this record: 1.10859491 sliding-window exact, 1.13238846 roundtrip exact, and 93.26ms final step_avg
the latest fetched PR head updates README.md and train_gpt_human.py, but the runnable compressed train_gpt.py wrapper still logs gptq:reserving 14000ms, so the executable code path does not reflect the README-only 9000ms change
rerun artifacts added in this branch:
- records/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/RERUN_NOTES.md
- records/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/logs/repro_pr1089_latest_seed42_20260330_072712.txt

Test plan

Run SEED=42 MAX_WALLCLOCK_SECONDS=600 torchrun --standalone --nproc_per_node=8 train_gpt.py
Compare the resulting log against the bundled train_seed42.log
Record the environment and exact rerun outputs in-repo

Made with Cursor

…xed-Precision 11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision int6/int7 with Hessian sensitivity, brotli compression. Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100). Awaiting 3-seed validation on 8xH100 before opening PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… budget Removed periodic eval_val during training loop (fired at step 0 and 4000) and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on evals that don't affect the final score — the real evaluation happens in the post-quantization sweep. Reclaimed time yields ~100-150 extra training steps at steady-state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d architecture details - Added XSA (Cross-Sequence Attention) all 11 layers as key innovation - Fixed quantization description: int5 baseline with selective promotion to int6/int7 - Clarified GPTQ Hessian collection runs within training budget (14s reserved) - Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Changes from previous run: - Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training) - Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget) - Human-readable source preserved as train_gpt_human.py 3-seed mean val_bpb (SW s64): 1.1091 Max artifact bytes: 15993904 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s). 9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s of training budget (~53 extra steps at 93ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add the raw rerun log and a comparison note so the slower step time, reduced training steps, and resulting BPB gap are recorded against the published seed-42 run. Made-with: Cursor

mikeapedia and others added 9 commits March 29, 2026 09:41

Add torch>=2.11 and Python>=3.12 version constraints to requirements.txt

47bf9aa

Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fill in 3-seed 8xH100 validation results

4f04eff

Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add code shrinking script for submission size optimization

7a465c4

AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document latest PR1089 seed-42 rerun gap.

3a7a478

Add the raw rerun log and a comparison note so the slower step time, reduced training steps, and resulting BPB gap are recorded against the published seed-42 run. Made-with: Cursor

AnirudhRahul changed the title ~~Rerun of PR #1089 on latest head~~ review: Rerun of PR #1089 Mar 30, 2026

dexhunter mentioned this pull request Mar 31, 2026

review: Rerun of PR #1120 (Rascal) on 8xH100 SXM #1177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review: Rerun of PR #1089#1126

review: Rerun of PR #1089#1126
AnirudhRahul wants to merge 9 commits intoopenai:mainfrom
AnirudhRahul:rerun

AnirudhRahul commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnirudhRahul commented Mar 30, 2026

Summary

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants