Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394
Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394greqone wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new non-record 10-minute / 16MB artifact-cap submission folder under records/track_non_record_16mb, packaging a self-contained train_gpt.py snapshot plus run artifacts for an 8xH100 SXM (RunPod) run using native FlashAttention (FA3) and torch.compile.
Changes:
- Add a self-contained training script (
train_gpt.py) with inlined FlashAttention interface logic, Backout residual, and sliding-window evaluation. - Include exact run artifacts (
train.log) and metadata (submission.json) for the reportedval_bpb=1.12467423. - Add reproducibility notes (
README.md) and a minimal dependency list (requirements.txt).
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train_gpt.py | Self-contained training + export + int6 quant + sliding-window eval script for the submission run. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train.log | Captured training/eval log for the submitted run. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/submission.json | Leaderboard-style metadata for the non-record entry. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/requirements.txt | Dependencies needed to reproduce locally (per repo guidance). |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/README.md | Run description, artifact accounting, and reproduction command. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), | ||
| default=0, | ||
| ) + 1 | ||
| late_k_layers = set(range(num_layers_total - 2, num_layers_total)) |
There was a problem hiding this comment.
late_k_layers is computed but never used, which makes the quantization logic harder to follow and suggests a partially removed feature. Consider deleting it or wiring it into the intended “late-K passthrough” behavior so the code matches the stated design.
| late_k_layers = set(range(num_layers_total - 2, num_layers_total)) |
| train_gpt_submit.py — Submission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE + | ||
| fp16 embed + late-K passthrough + sliding window eval. |
There was a problem hiding this comment.
The module docstring refers to train_gpt_submit.py and lists features (e.g., “fp16 embed”, “MTP”) that don’t clearly match the actual defaults/implementation in this file. This can be confusing when someone audits the submission; consider updating the docstring to reflect the actual filename and the concrete features enabled in this snapshot (or remove the feature list).
| train_gpt_submit.py — Submission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE + | |
| fp16 embed + late-K passthrough + sliding window eval. | |
| Training script for GPT models used in parameter-golf submissions. | |
| This module's behavior is defined by the hyperparameters and options below; refer to | |
| the code and configuration flags instead of this docstring for an authoritative list | |
| of enabled features. |
| def _load_system_flash_attn_interface(): | ||
| for entry in sys.path: | ||
| if not entry: | ||
| continue | ||
| try: | ||
| resolved = Path(entry).resolve() | ||
| except OSError: | ||
| continue | ||
| candidate = resolved / "flash_attn_interface.py" | ||
| if not candidate.exists() or candidate.resolve() == here: | ||
| continue | ||
| if repo_root in candidate.resolve().parents: | ||
| continue | ||
| spec = importlib.util.spec_from_file_location("_system_flash_attn_interface", candidate) | ||
| if spec is None or spec.loader is None: | ||
| continue | ||
| module = importlib.util.module_from_spec(spec) | ||
| sys.modules[spec.name] = module | ||
| spec.loader.exec_module(module) | ||
| fn = getattr(module, "flash_attn_func", None) | ||
| if callable(fn): | ||
| return fn | ||
| return None |
There was a problem hiding this comment.
_load_system_flash_attn_interface() dynamically locates and executes an arbitrary flash_attn_interface.py from sys.path. This is a code-execution footgun (and can make runs non-reproducible if sys.path differs). Consider removing this path-walk entirely, or gating it behind an explicit env var that points to a known file and validating it’s in an expected location (e.g., site-packages) before importing.
| except OSError: | ||
| continue | ||
| candidate = resolved / "flash_attn_interface.py" | ||
| if not candidate.exists() or candidate.resolve() == here: |
There was a problem hiding this comment.
In _load_system_flash_attn_interface, the check candidate.resolve() == here will never be true because candidate is flash_attn_interface.py while here is train_gpt.py. If the intent is to avoid importing a repo-local helper, consider removing this condition (the subsequent repo_root parent check already covers it) or comparing against the actual helper path.
| if not candidate.exists() or candidate.resolve() == here: | |
| if not candidate.exists(): |
Summary
8xH100 SXMPR315-style run plus Backouttrain_gpt.py,requirements.txt,submission.json, and READMEResult
val_bpb = 1.124674231.8989602915,545,6628xH100 SXMon RunPod with native Hopper FlashAttention andtorch.compileNotes
flash_attn_interface.py; for this submission folder that helper is inlined intotrain_gpt.pyso the package is self-contained and closer to the repo guidance that counted code should live intrain_gpt.pyrecords/track_non_record_16mb/...