Closed
Conversation
Training-only improvements on the 9x512 baseline: - Quantization-aware training (STE) during last 30% of wallclock recovers post-quant BPB degradation - Gradient clipping at 1.0 stabilizes training - No architectural changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new 10-minute/16MB record submission that aims to improve post-quantization BPB by enabling quantization-aware training (STE) late in training, and stabilizes training by enabling gradient clipping by default.
Changes:
- Introduces QAT (STE-based) for
CastedLinearweights and tied embedding projection, activated after a configurable fraction of wallclock time. - Changes default
GRAD_CLIP_NORMfrom disabled (0.0) to 1.0. - Adds record metadata (
submission.json) and an explanatory README for the submission.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/train_gpt.py | Adds QAT activation logic + STE quantization in forward pass; changes grad clipping default. |
| records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/submission.json | Adds submission metadata entry for the new record. |
| records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/README.md | Documents the training-only changes and new hyperparameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+32
to
+35
| | Parameter | Default | Description | | ||
| |-----------|---------|-------------| | ||
| | `GRAD_CLIP_NORM` | 1.0 | Max gradient norm (0 = disabled) | | ||
| | `QAT_START_FRAC` | 0.70 | Fraction of wallclock after which QAT activates | |
| with torch.no_grad(): | ||
| row_max = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-12) | ||
| scale = row_max / 127.0 | ||
| w_q = (w / scale).round().clamp(-128, 127) * scale |
Comment on lines
+514
to
+518
| with torch.no_grad(): | ||
| row_max = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-12) | ||
| scale = row_max / 127.0 | ||
| w_q = (w / scale).round().clamp(-128, 127) * scale | ||
| return w + (w_q - w).detach() |
Comment on lines
+7
to
+10
| "val_loss": 0.0, | ||
| "val_bpb": 0.0, | ||
| "bytes_total": 0, | ||
| "bytes_code": 0 |
|
|
||
| The baseline loses ~0.0072 BPB when post-training quantization converts fp32 weights to int8. QAT uses a straight-through estimator to simulate int8 per-row quantization during training: the forward pass sees quantized weights, but gradients flow through as if quantization didn't happen. This teaches the model to place its weights in regions that survive int8 rounding, recovering most of the quantization gap. | ||
|
|
||
| QAT is applied to: |
gb250e
referenced
this pull request
in gb250e/parameter-golf
Mar 21, 2026
dhruvjatkar
pushed a commit
to dhruvjatkar/parameter-golf
that referenced
this pull request
Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant openai#5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruvjatkar
pushed a commit
to dhruvjatkar/parameter-golf
that referenced
this pull request
Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deborahnelson8788726
pushed a commit
to deborahnelson8788726/parameter-golf
that referenced
this pull request
Apr 2, 2026
- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Opened by mistake by an automated agent. Please delete this PR and its branch. Apologies for the noise.