Skip to content

review: Rerun of PR #1089#1126

Open
AnirudhRahul wants to merge 9 commits intoopenai:mainfrom
AnirudhRahul:rerun
Open

review: Rerun of PR #1089#1126
AnirudhRahul wants to merge 9 commits intoopenai:mainfrom
AnirudhRahul:rerun

Conversation

@AnirudhRahul
Copy link
Copy Markdown

Summary

  • rerun the latest fetched #1089 head (215193e) from the executable submission wrapper train_gpt.py and commit the raw seed-42 rerun log plus a comparison note
  • document the rerun environment: 8x H100 80GB, driver 565.57.01, Python 3.12.13, torch 2.11.0+cu126, torch.version.cuda == 12.6
  • show that the rerun is materially slower than the bundled published seed 42 run: final step_avg 105.06ms vs 93.26ms (about 12.65% slower), with 5579 steps vs 6284
  • report the rerun seed-42 metrics: final_int6_sliding_window_exact val_bpb: 1.11228538 and final_int6_roundtrip_exact val_bpb: 1.13606522

Notes

  • reference published seed 42 values from this record: 1.10859491 sliding-window exact, 1.13238846 roundtrip exact, and 93.26ms final step_avg
  • the latest fetched PR head updates README.md and train_gpt_human.py, but the runnable compressed train_gpt.py wrapper still logs gptq:reserving 14000ms, so the executable code path does not reflect the README-only 9000ms change
  • rerun artifacts added in this branch:
    • records/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/RERUN_NOTES.md
    • records/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/logs/repro_pr1089_latest_seed42_20260330_072712.txt

Test plan

  • Run SEED=42 MAX_WALLCLOCK_SECONDS=600 torchrun --standalone --nproc_per_node=8 train_gpt.py
  • Compare the resulting log against the bundled train_seed42.log
  • Record the environment and exact rerun outputs in-repo

Made with Cursor

mikeapedia and others added 9 commits March 29, 2026 09:41
…xed-Precision

11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash
embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision
int6/int7 with Hessian sensitivity, brotli compression.

Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100).
Awaiting 3-seed validation on 8xH100 before opening PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Syncs version requirements with pyproject.toml. torch>=2.11 is needed for
torch.compile fullgraph improvements and CUDA 13.0 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds: 42, 1337, 2025
Mean val_bpb (SW s64): 1.1086
Max artifact bytes: 15997089

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… budget

Removed periodic eval_val during training loop (fired at step 0 and 4000)
and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on
evals that don't affect the final score — the real evaluation happens in
the post-quantization sweep. Reclaimed time yields ~100-150 extra training
steps at steady-state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper.
Reduces train_gpt.py code_bytes, freeing artifact budget for model weights.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d architecture details

- Added XSA (Cross-Sequence Attention) all 11 layers as key innovation
- Fixed quantization description: int5 baseline with selective promotion to int6/int7
- Clarified GPTQ Hessian collection runs within training budget (14s reserved)
- Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes from previous run:
- Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training)
- Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget)
- Human-readable source preserved as train_gpt_human.py

3-seed mean val_bpb (SW s64): 1.1091
Max artifact bytes: 15993904

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s).
9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s
of training budget (~53 extra steps at 93ms/step).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the raw rerun log and a comparison note so the slower step time, reduced training steps, and resulting BPB gap are recorded against the published seed-42 run.

Made-with: Cursor
@AnirudhRahul AnirudhRahul changed the title Rerun of PR #1089 on latest head review: Rerun of PR #1089 Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants