Open
Conversation
…xed-Precision 11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision int6/int7 with Hessian sensitivity, brotli compression. Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100). Awaiting 3-seed validation on 8xH100 before opening PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… budget Removed periodic eval_val during training loop (fired at step 0 and 4000) and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on evals that don't affect the final score — the real evaluation happens in the post-quantization sweep. Reclaimed time yields ~100-150 extra training steps at steady-state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d architecture details - Added XSA (Cross-Sequence Attention) all 11 layers as key innovation - Fixed quantization description: int5 baseline with selective promotion to int6/int7 - Clarified GPTQ Hessian collection runs within training budget (14s reserved) - Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes from previous run: - Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training) - Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget) - Human-readable source preserved as train_gpt_human.py 3-seed mean val_bpb (SW s64): 1.1091 Max artifact bytes: 15993904 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s). 9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s of training budget (~53 extra steps at 93ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the raw rerun log and a comparison note so the slower step time, reduced training steps, and resulting BPB gap are recorded against the published seed-42 run. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#1089head (215193e) from the executable submission wrappertrain_gpt.pyand commit the raw seed-42 rerun log plus a comparison note565.57.01,Python 3.12.13,torch 2.11.0+cu126,torch.version.cuda == 12.6seed 42run: finalstep_avg105.06msvs93.26ms(about12.65%slower), with5579steps vs6284final_int6_sliding_window_exact val_bpb: 1.11228538andfinal_int6_roundtrip_exact val_bpb: 1.13606522Notes
seed 42values from this record:1.10859491sliding-window exact,1.13238846roundtrip exact, and93.26msfinalstep_avgREADME.mdandtrain_gpt_human.py, but the runnable compressedtrain_gpt.pywrapper still logsgptq:reserving 14000ms, so the executable code path does not reflect the README-only9000mschangerecords/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/RERUN_NOTES.mdrecords/track_10min_16mb/2026-03-28_TurboMuon_EngramLite_ParamBanking/logs/repro_pr1089_latest_seed42_20260330_072712.txtTest plan
SEED=42 MAX_WALLCLOCK_SECONDS=600 torchrun --standalone --nproc_per_node=8 train_gpt.pytrain_seed42.logMade with Cursor